“Advanced Graphics and Data Visualization in R” is brought to you by the Centre for the Analysis of Genome Evolution & Function’s (CAGEF) bioinformatics training initiative. This CSB1021 was developed to enhance the skills of students with basic backgrounds in R by focusing on available philosophies, methods, and packages for plotting scientific data. While the datasets and examples used in this course will be centred on SARS-CoV-2 epidemiological and genomic data, the lessons learned herein will be broadly applicable.
This lesson is the sixth and final lecture in a 6-part series. The aim for the end of this series is for students to recognize how to import, format, and display data based on their intended message and audience. The format and style of these visualizations will help to identify and convey the key message(s) from their experimental data.
The structure of the class is a code-along style in R markdown notebooks. At the start of each lecture, skeleton versions of the lecture will be provided for use on the University of Toronto datatools Hub so students can program along with the instructor.
This week will focus on commonly used visualizations related to genomic information. When working with sequencing data, you may commonly wish to compare sequences based on their relationships or relative similarity (trees), by their sequence identity in gene families or potential interactions (graphs), and or more directly their sequence motifs.
At the end of this lecture you will have covered the following topics
grey background - a package, function, code, command or
directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or
folder
bold - heading or a term that is being defined
blue text - named or unnamed
hyperlink
... - Within each coding cell this will indicate an area
of code that students will need to complete for the code cell to run
correctly.
Blue box: A key concept that is being introduced
Yellow box: Risk or caution
Green boxes: Recommended reads and resources to learn R
Red boxes: A comprehension question which may or may not involve a coding cell. You usually find these at the end of a section.
Each week, new lesson files will appear within your RStudio folders.
We are pulling from a GitHub repository using this Repository
git-pull link. Simply click on the link and it will take you to the
University of Toronto datatools
Hub. You will need to use your UTORid credentials to complete the
login process. From there you will find each week’s lecture files in the
directory /2024-03-Adv_Graphics_R/Lecture_XX. You will find
a partially coded skeleton.Rmd file as well as all of the
data files necessary to run the week’s lecture.
Alternatively, you can download the R-Markdown Notebook
(.Rmd) and data files from the RStudio server to your
personal computer if you would like to run independently of the Toronto
tools.
A live lecture version will be available at camok.github.io that will update as the lecture progresses. Be sure to refresh to take a look if you get lost!
As mentioned above, at the end of each lecture there will be a completed version of the lecture code released as a PDF file under the Modules section of Quercus.
Today’s datasets will focus on SARS-CoV-2 variant surveillance data from the which has been tracking published sequenced genomes for the appearance of new strains in North America.
This is a Newick format data set describing a phylogenetic tree of SARS-CoV-2 strain information.
Metadata that accompanies the first dataset. It links strain information back to as much geographical and related information as possible.
tidyverse which has a number of packages including
dplyr, tidyr, stringr,
forcats and ggplot2
viridis helps to create color-blind palettes for our
data visualizations
RColorBrewer has some hlepful palettes that we’ll need
to colour our data.
ggnewscale will be helpful in generating multiple colour
palettes across increasingly complex plots.
treeio, tidytree, and ggtree
will be used to help import, parse and plot phylogenetic trees.
tidygraph and ggraph will be used to
generate network graph objects and plot them.
ggseqlogo is used for generating sequence logos related
to sequence motifs.
qqman is a wrapper package for generating Manhattan
plots.
lubridate and zoo help us to work with some
date-based information.
# When do we start installing the packages
Sys.time()
[1] "2024-04-18 14:17:19 EDT"
# New bioconductor packages we haven't worked with before
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("ggnewscale", update = FALSE)
'getOption("repos")' replaces Bioconductor standard repositories, see
'help("repositories", package = "BiocManager")' for details.
Replacement repositories:
CRAN: https://cran.r-project.org
Bioconductor version 3.12 (BiocManager 1.30.22), R 4.0.5 (2021-03-31)
Warning: package(s) not installed when version(s) same as or greater than current; use
`force = TRUE` to re-install: 'ggnewscale'
BiocManager::install("ggtree", update = FALSE)
'getOption("repos")' replaces Bioconductor standard repositories, see
'help("repositories", package = "BiocManager")' for details.
Replacement repositories:
CRAN: https://cran.r-project.org
Bioconductor version 3.12 (BiocManager 1.30.22), R 4.0.5 (2021-03-31)
Warning: package(s) not installed when version(s) same as or greater than current; use
`force = TRUE` to re-install: 'ggtree'
BiocManager::install("ggtreeExtra", update = FALSE)
'getOption("repos")' replaces Bioconductor standard repositories, see
'help("repositories", package = "BiocManager")' for details.
Replacement repositories:
CRAN: https://cran.r-project.org
Bioconductor version 3.12 (BiocManager 1.30.22), R 4.0.5 (2021-03-31)
Warning: package(s) not installed when version(s) same as or greater than current; use
`force = TRUE` to re-install: 'ggtreeExtra'
# New CRAN packages we haven't worked with before
install.packages("tidytree")
Warning: package 'tidytree' is in use and will not be installed
install.packages("ggseqlogo")
Warning: package 'ggseqlogo' is in use and will not be installed
install.packages("qqman")
Warning: package 'qqman' is in use and will not be installed
install.packages("ggraph")
Installing package into 'C:/Users/mokca/AppData/Local/R/win-library/4.0'
(as 'lib' is unspecified)
also installing the dependencies 'igraph', 'tidygraph', 'graphlayouts'
There are binary versions available but the source versions are later:
binary source needs_compilation
igraph 1.3.0 2.0.3 TRUE
tidygraph 1.2.1 1.3.1 TRUE
graphlayouts 0.8.0 1.1.1 TRUE
ggraph 2.0.5 2.2.1 TRUE
installing the source packages 'igraph', 'tidygraph', 'graphlayouts', 'ggraph'
Warning in install.packages("ggraph"): installation of package 'igraph' had
non-zero exit status
Warning in install.packages("ggraph"): installation of package 'tidygraph' had
non-zero exit status
Warning in install.packages("ggraph"): installation of package 'graphlayouts'
had non-zero exit status
Warning in install.packages("ggraph"): installation of package 'ggraph' had
non-zero exit status
install.packages("tidygraph")
Installing package into 'C:/Users/mokca/AppData/Local/R/win-library/4.0'
(as 'lib' is unspecified)
also installing the dependency 'igraph'
There are binary versions available but the source versions are later:
binary source needs_compilation
igraph 1.3.0 2.0.3 TRUE
tidygraph 1.2.1 1.3.1 TRUE
installing the source packages 'igraph', 'tidygraph'
Warning in install.packages("tidygraph"): installation of package 'igraph' had
non-zero exit status
Warning in install.packages("tidygraph"): installation of package 'tidygraph'
had non-zero exit status
# When do we finish?
Sys.time()
[1] "2024-04-18 14:19:10 EDT"
# Packages to help tidy our data
library(tidyverse)
library(readxl)
# Packages for the graphical analysis section
library(viridis)
library(RColorBrewer)
library(ggnewscale)
# Packages for today's lecture about trees
library(tidytree)
library(ggtree)
Error: package or namespace load failed for 'ggtree' in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]):
there is no package called 'aplot'
library(ggtreeExtra)
library(treeio)
# Packages for graphs
library(ggraph)
Error in library(ggraph): there is no package called 'ggraph'
library(tidygraph)
Error in library(tidygraph): there is no package called 'tidygraph'
# Packages for sequence logos
library(ggseqlogo)
# Packages for Manhattan plots
library(qqman)
# Date calculation helpers
library(lubridate)
library(zoo)
Dendrograms, cladograms, and phylogenetic trees all share a similar structure with some modifications. All are represented in branching tree structures that are used to represent the relationships among the leaves, also known as tips. Branches along the tree may also be referred to as edges.
Leaves can represent different species, strains, sequences or multi-dimensional values. Leaves are connected by branches to their nearest neighbours or relatives. As you move backwards along the tree, you encounter internal nodes which can connect directly to more tips or more nodes. The distance between tips and nodes on a phylogenetic tree, represent a relative distance that may be defined as some type of evolutionary distance, time, or simple euclidean distance. In a cladogram, however, distances along branches have no meaning and only define the presence of a relationship.
In tree terminology, it is helpful to define a few more terms:
| Term | Description |
|---|---|
| Root | A tree is rooted when it defines a common ancestor for all tips in the tree. This could be considered the start or entry point for the tree |
| Most recent common ancester (MRCA) | This is an internal node that collectively joins two or more tips. |
| Clade | A group of taxa (tips) that includes a common ancestor and all of its descendants. |
| Scaling | This indicates that the branch lengths do represent a distance metric, whereas, an unscaled tree may have even-length branches not representative of evolutionary/relationship distances. |
Diagram of phylogenetic tree terms from 10 differences between cladograms and phylogenetics trees
As we saw last week, clustering analysis such as that from
hclust() can create dendrogram data. We plotted this in a
couple of ways by itself using fviz_dend() from the
factoextra package and with Heatmap() from the
ComplexHeatmap package. In the case of our first foray, we
were looking at the “relationships” between samples by indicating how
similar they were based on the characteristics in their various
features.
Depending on the software used, trees can be represented in a number of formats, some of which are described below.
| Format | Description |
|---|---|
| Newick | The standard used to represent trees in a computer-readable format. Trees are encoded in a parentheses-closure format where each tip takes the form of “taxa:branch-length” |
and pairs of tips are separated by a comma
, and enclosed by parentheses (). Internal
nodes branch lengths are defined outside the parentheses with
“():branch-length” |
|
| NEXUS/phylip | Incorporates the Newick tree with separate related data that is partitioned into different blocks. Blocks are started with “BEGIN <BLOCK NAME>;” and closed with “END;” |
| NHX | New Hampshire eXtended format is also based on Newick format but instead of code blocks uses a tagging system for each node extending the format to “taxa:branch-length[&&NHX:<tag information>]” |
read.tree() from the treeio
packageWhen opening tree files, depending on the format, you can use the
treeio package and one of it’s many parsing
functions but the simplest way to avoid figuring out how to open a
tree file, is to just use the read.tree() function. This
will determine the appropriate file type and parse through the many
different tree formats before depositing the data into a
phylo object.
Today we’ll be working with a dataset from the Auspice COVID-19 North American dataset maintained by Finlay Maguire. Unfortunately this dataset is no longer available but we have an older copy with a fair amount of metadata.
Let’s begin by opening up the tree data and some related metadata before taking a look under the hood.
# Import our Newick Tree
SC2_variants_time.phylo <- read.tree("./data/nextstrain_ncov_north-america_canada_timetree.nwk")
# Import our metadata
SC2_metadata.df <- read_tsv("data/nextstrain_ncov_north-america_canada_metadata.tsv")
[1mRows: [22m[34m6627[39m [1mColumns: [22m[34m24[39m
[36m--[39m [1mColumn specification[22m [36m-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------[39m
[1mDelimiter:[22m "\t"
[31mchr[39m (21): Strain, GISAID clade, Nextstrain clade, Clade, Country, Admin Div...
[32mdbl[39m (1): Age
[33mlgl[39m (1): url
[34mdate[39m (1): Collection Data
[36mi[39m Use `spec()` to retrieve the full column specification for this data.
[36mi[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
# What does the tree look like?
str(SC2_variants_time.phylo)
List of 6
$ edge : int [1:12776, 1:2] 6628 6628 6629 6630 6631 6630 6632 6632 6633 6629 ...
$ edge.length: num [1:12776] 0.0757 0.0848 0.0336 0.0501 0 ...
$ Nnode : int 6150
$ node.label : chr [1:6150] "NODE_0008606" "NODE_0000001" "NODE_0000061" "USA/MA1/2020_travel_history" ...
$ tip.label : chr [1:6627] "Wuhan/WH01/2019" "USA/MA1/2020" "Canada/ON_ON-VIDO-01-2/2020" "Canada/ON_VIDO-01/2020" ...
$ root.edge : num 2020
- attr(*, "class")= chr "phylo"
- attr(*, "order")= chr "cladewise"
SC2_variants_time.phylo
Phylogenetic tree with 6627 tips and 6150 internal nodes.
Tip labels:
Wuhan/WH01/2019, USA/MA1/2020, Canada/ON_ON-VIDO-01-2/2020, Canada/ON_VIDO-01/2020, USA/CA-CDC-5/2020, Taiwan/CGMH-CGU-02/2020, ...
Node labels:
NODE_0008606, NODE_0000001, NODE_0000061, USA/MA1/2020_travel_history, NODE_0000062, Canada/ON_VIDO-01/2020_travel_history, ...
Rooted; includes branch lengths.
phylo object is a simple listFrom the looks of our structure, after reading in our Newick file, we
have a list with six elements. They are pretty clearly named with edge
information (a matrix of paired numbers), edge lengths, a count of the
number of nodes (ie internal points where branches bifurcate.
While mostly just numbers, some of these node.label values
appear to be based on travel history information. All of the 6,627
tip.label values correspond to 6,627 SARS-CoV-2 strain
names found in the metadata file as well.
Suppose we want to combine some information from our metadata with our tree? We can then use this information to colour, shape, and bring more visual order to our tree. Working with the tree, however, can be a little hard given that we have lists of different lengths representing different portions of the tree.
In our case, we only have tree tip information that we want to add to
and the easiest way to work would be if it was in a table format. The
package tidytree provides a function as_tibble
that will parse and convert the phylo format to a
tbl_tree object, which is a kind of tidy data frame.
Let’s convert our variant phylo object to something we can work with.
# Convert the phylo object using as_tibble
SC2_variants_time.tb <- as_tibble(SC2_variants_time.phylo)
# Check out the structure now
str(SC2_variants_time.tb)
tbl_tree [12,777 x 4] (S3: tbl_tree/tbl_df/tbl/data.frame)
$ parent : int [1:12777] 6628 6631 6632 6633 6636 6635 6638 6638 6639 6640 ...
$ node : int [1:12777] 1 2 3 4 5 6 7 8 9 10 ...
$ branch.length: num [1:12777] 0.0757 0 0.0077 0 0 ...
$ label : chr [1:12777] "Wuhan/WH01/2019" "USA/MA1/2020" "Canada/ON_ON-VIDO-01-2/2020" "Canada/ON_VIDO-01/2020" ...
head(SC2_variants_time.tb)
As you can see our SC2_variants_time.tb object is a
simple data frame now represented as a 4-column table with 12,777 rows.
The edge data is encoded between the parent and
node columns with branch.length to define the
length of each edge. We also have all of the tip and node names stored
under the label column. With the tree in this format, we
can now treat the tree like a tibble and join additional information to
the tree.
Rules to keep in mind:
Don’t lose any node information from your original tree structure. Losing “observations” is essentially losing edges of your tree!
Try to work with a complete dataset of information although sometimes it doesn’t make sense that you would have it all.
Convert your tibble back to a tree format when you’re done with
as.treedata()
You may have information about the leaves or tips of your tree but by
default there is no real internal node information when it comes to
sample origins or lineage. When joining information to the tables, use
the correct direction *_join() to avoid data loss. A
full_join() is probably the safest.
Let’s take a quick look at our metadata and identify what we’re interested in.
# Look at the metadata, which bits of information would be nice to add?
str(SC2_metadata.df, give.attr = FALSE)
spc_tbl_ [6,627 x 24] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Strain : chr [1:6627] "Wuhan/WH01/2019" "USA/MA1/2020" "Canada/ON_ON-VIDO-01-2/2020" "Canada/ON_VIDO-01/2020" ...
$ GISAID clade : chr [1:6627] "L" "L" "L" "L" ...
$ Nextstrain clade : chr [1:6627] "19A" "19A" "19A" "19A" ...
$ Age : num [1:6627] 44 21 NA NA 54 NA 47 47 23 55 ...
$ Clade : chr [1:6627] "19A" "19A" "19A" "19A" ...
$ Country : chr [1:6627] "Asia" "USA" "Canada" "Canada" ...
$ Admin Division : chr [1:6627] "Asia" "Massachusetts" "Ontario" "Ontario" ...
$ Admin Division of exposure: chr [1:6627] "Asia" "Asia" "Ontario" "Asia" ...
$ gisaid_epi_isl : chr [1:6627] "EPI_ISL_406798" "EPI_ISL_409067" "EPI_ISL_425177" "EPI_ISL_413015" ...
$ Host : chr [1:6627] "Human" "Human" "Human" "Human" ...
$ Originating Lab : chr [1:6627] "General Hospital of Central Theater Command of People's Liberation Army of China" "Massachusetts Department of Public Health" "Public Health Ontario" "Public Health Ontario Laboratory" ...
$ PANGO lineage : chr [1:6627] "B" "B" "B" "B" ...
$ Submission Date : chr [1:6627] "Older" "Older" "Older" "Older" ...
$ Region : chr [1:6627] "Asia" "North America" "North America" "North America" ...
$ Region of exposure : chr [1:6627] "Asia" "Asia" "North America" "Asia" ...
$ Sex : chr [1:6627] "Male" "Male" "Male" "Male" ...
$ strain : chr [1:6627] "Wuhan/WH01/2019" "USA/MA1/2020" "Canada/ON_ON-VIDO-01-2/2020" "Canada/ON_VIDO-01/2020" ...
$ Emerging Nextstrain clade : chr [1:6627] "19A" "19A" "19A" "19A" ...
$ Submitting Lab : chr [1:6627] "BGI & Institute of Microbiology, Chinese Academy of Sciences & Shandong First Medical University & Shandong Aca"| __truncated__ "Pathogen Discovery, Respiratory Viruses Branch, Division of Viral Diseases, Centers for Disease Control and Prevention" "Public Health Agency of Canada - National Microbiology Laboratory" "National Microbiology Laboratory" ...
$ url : logi [1:6627] NA NA NA NA NA NA ...
$ Collection Data : Date[1:6627], format: "2019-12-26" "2020-01-29" ...
$ Author : chr [1:6627] "Weijun Chen et al (https://dx.doi.org/10.1016/S0140-6736(20)30251-8)" "Clinton R. Paden et al (https://dx.doi.org/10.1101/2020.03.09.20032896)" "Amrit S. Boese et al" "Shari Tyson et al" ...
$ Country of exposure : chr [1:6627] NA "Asia" NA "Asia" ...
$ Location : chr [1:6627] NA "Boston" NA NA ...
# Check out some of the values for these variables
# Metadata information on the samples
unique(SC2_metadata.df$Country)
[1] "Asia" "USA" "Canada"
[4] "Europe" "Guatemala" "Panama"
[7] "Africa" "Dominican Republic" "Costa Rica"
[10] "South America" "Oceania" "Jamaica"
[13] "Mexico" "Barbados" "Guadeloupe"
[16] "Saint Martin" "Bermuda" "Sint Maarten"
[19] "Saint Barthélemy" "Belize" "Saint Lucia"
[22] "El Salvador" "Cuba"
Asia
USA
Canada
Europe
Guatemala
Panama
Africa
Dominican Republic
Costa Rica
South America
Oceania
Jamaica
Mexico
Barbados
Guadeloupe
Saint Martin
Bermuda
Sint Maarten
Saint Barthélemy
Belize
Saint Lucia
El Salvador
Cuba
unique(SC2_metadata.df$Age)
[1] 44 21 NA 54 47 23 55 62 49 50 43 25 26 33 38 9 73 64
[19] 57 58 0 28 70 71 20 93 36 37 40 56 89 45 34 53 94 95
[37] 86 84 60 90 27 30 59 41 46 52 51 32 24 35 29 61 79 66
[55] 78 11 2 65 75 31 48 74 68 67 63 77 42 22 12 5 17 80
[73] 16 39 82 8 7 88 76 69 19 87 3 14 18 1 81 15 6 10
[91] 85 4 98 91 72 99 13 92 83 96 104
# Metadata information on the genomes
unique(SC2_metadata.df$'GISAID clade')
[1] "L" "S" "O" "V" "GR" "G" "GV" "GH" "GRY" NA
L
S
O
V
GR
G
GV
GH
GRY
NA```
<!-- rnb-output-end -->
<!-- rnb-source-begin eyJkYXRhIjoidW5pcXVlKFNDMl9tZXRhZGF0YS5kZiQnTmV4dHN0cmFpbiBjbGFkZScpIn0= -->
```r
unique(SC2_metadata.df$'Nextstrain clade')
[1] "19A" "19B" NA "20A" "20B"
[6] "20E (EU1)" "20A.EU2" "20C" "20G" "20H/501Y.V2"
[11] "20D" "20J/501Y.V3" "20F" "20I/501Y.V1"
19A
19B
NA
20A
20B
20E (EU1)
20A.EU2
20C
20G
20H/501Y.V2
20D
20J/501Y.V3
20F
20I/501Y.V1
unique(SC2_metadata.df$Clade)
[1] "19A" "19B" "20A" "20E (EU1)" "20C"
[6] "20G" "20H/501Y.V2" "20B" "20D" "20J/501Y.V3"
[11] "20F" "20I/501Y.V1"
19A
19B
20A
20E (EU1)
20C
20G
20H/501Y.V2
20B
20D
20J/501Y.V3
20F
20I/501Y.V1
unique(SC2_metadata.df$'PANGO lineage')
[1] "B" "A" "A.6" "A.2" "A.2.4"
[6] "A.2.5" "A.5" "A.1" "A.21" "A.18"
[11] "A.12" "A.28" "A.19" "A.25" "A.23"
[16] "A.23.1" "B.12" "B.56" "B.4.7" "B.31"
[21] "B.61" "B.40" "B.35" "B.28" "B.11"
[26] "B.55" "B.27" "B.3" "B.4" "B.4.8"
[31] "B.4.2" "B.53" "B.6" "B.1" "B.6.6"
[36] "B.6.1" "B.6.2" "B.6.8" "B.1.256" "B.1.1"
[41] "B.1.1.420" "B.1.240" "B.1.205" "B.1.128" "B.1.221"
[46] "B.1.223" "B.1.214.2" "B.1.214" "B.1.146" "B.1.147"
[51] "B.1.104" "B.1.391" "B.1.380" "B.1.203" "B.1.540"
[56] "B.1.416" "B.1.600" "B.1.160" "B.1.195" "B.1.609"
[61] "B.1.561" "B.1.232" "B.1.558" "B.1.239" "B.1.237"
[66] "B.1.597" "B.1.480" "B.1.397" "B.1.408" "B.1.243"
[71] "B.1.240.2" "B.1.240.1" "B.1.524" "B.1.179" "B.1.1.298"
[76] "B.1.400" "B.1.398" "B.1.236" "B.1.234" "B.1.459"
[81] "B.1.131" "B.1.525" "B.1.563" "B.1.411" "B.1.379"
[86] "B.1.177.31" "B.1.177" "B.1.177.73" "B.1.177.11" "B.1.177.14"
[91] "B.1.177.15" "AA.1" "B.1.177.44" "B.1.177.39" "B.1.177.40"
[96] "B.1.177.35" "B.1.177.82" "B.1.177.87" "B.1.177.75" "B.1.177.45"
[101] "B.1.177.28" "B.1.177.32" "B.1.177.37" "B.1.177.72" "B.1.117.2"
[106] "B.1.177.4" "B.1.177.53" "W.2" "W.1" "B.1.177.50"
[111] "B.1.177.81" "B.1.177.58" "B.1.177.55" "B.1.177.56" "B.1.177.57"
[116] "B.1.177.51" "B.1.177.52" "B.1.177.21" "B.1.177.77" "B.1.177.6"
[121] "B.1.177.7" "B.1.177.86" "B.1.177.8" "B.1.177.62" "B.1.177.60"
[126] "U.3" "B.1.177.79" "B.1.9.5" "B.1.9" "B.1.439"
[131] "B.1.111" "B.1.162" "B.1.438" "B.1.164" "B.1.36"
[136] "B.1.441" "B.1.36.35" "B.1.468" "B.1.462" "B.1.471"
[141] "B.1.469" "B.1.466" "B.1.466.2" "B.1.470" "B.1.160.16"
[146] "B.1.456" "B.1.36.29" "B.1.36.31" "B.1.36.1" "B.1.36.8"
[151] "B.1.36.26" "B.1.36.34" "B.1.36.7" "B.1.36.16" "B.1.36.36"
[156] "B.1.36.38" "B.1.36.37" "B.1.36.18" "B.1.36.10" "B.1.476"
[161] "B.1.281" "B.1.273" "B.1.544" "B.1.270" "B.1.277"
[166] "B.1.279" "B.1.314" "B.1.367" "B.1.370" "B.1.413"
[171] "B.1.316" "B.1.3" "B.1.509" "B.1.504" "B.1.362"
[176] "B.1.499" "B.1.615" "B.1.611" "B.1.505" "B.1.422"
[181] "B.1.596.1" "B.1.362.2" "B.1.369" "B.1.564" "B.1.564.1"
[186] "B.1.369.1" "B.1.1.1" "B.1.428" "B.1.356" "B.1.601"
[191] "B.1.562" "B.1.360" "B.1.311" "B.1.598" "B.1.350"
[196] "B.1.426" "B.1.265" "B.1.336" "B.1.589" "B.1.567"
[201] "B.1.588.1" "B.1.595" "B.1.596" "B.1.2" "B.1.349"
[206] "B.1.366" "B.1.576" "B.1.603" "B.1.526.2" "B.1.526"
[211] "B.1.291" "B.1.575" "B.1.428.3" "B.1.428.1" "B.1.428.2"
[216] "B.1.427" "B.1.429" "B.1.495" "B.1.517" "B.1.324"
[221] "B.1.298" "B.1.573" "B.1.320" "B.1.361" "B.1.582"
[226] "B.1.612" "B.1.351" "B.1.1.255" "B.1.1.430" "B.1.1.306"
[231] "AE.7" "AE.2" "AE.5" "AE.6" "AE.4"
[236] "AE.8" "B.1.153" "B.1.329" "B.1.535" "B.1.313"
[241] "B.1.170" "B.1.22.1" "B.1.201" "B.1.227" "B.1.415"
[246] "B.1.94" "B.1.1.519" "B.1.1.222" "B.1.552" "B.1.199"
[251] "B.1.178" "B.1.220" "B.1.6" "B.1.84" "B.1.1.170"
[256] "B.1.93" "B.1.91" "B.1.219" "B.1.211" "B.1.388"
[261] "B.1.210" "B.1.555" "B.1.206" "B.1.260" "B.1.420"
[266] "B.1.192" "B.1.258" "B.1.258.17" "B.1.242" "B.1.126"
[271] "B.1.378" "B.1.565" "B.1.139" "B.1.8" "B.1.78"
[276] "B.1.396" "B.1.560" "B.1.527" "B.1.142" "B.1.393"
[281] "B.1.187" "B.1.149" "B.1.549" "B.1.545" "B.1.189"
[286] "B.1.528" "B.1.23" "B.1.67" "B.1.547" "P.2"
[291] "B.1.1.28" "B.1.390" "B.1.395" "B.1.530" "C.35"
[296] "B.1.233" "B.1.543" "B.6.7" "B.1.1.369" "B.1.1.372"
[301] "C.8" "C.36" "C.36.1" "C.16" "C.23"
[306] "C.1.1" "B.1.1.375" "C.2.1" "C.12" "B.1.1.161"
[311] "B.1.1.87" "B.1.1.192" "B.1.1.71" "B.1.1.10" "L.3"
[316] "L.1" "B.1.1.176" "B.1.1.398" "B.1.1.417" "B.1.1.63"
[321] "B.1.1.422" "B.1.1.487" "B.1.1.312" "B.1.1.331" "B.1.1.58"
[326] "B.1.1.277" "K.1" "B.1.1.30" "B.1.1.516" "B.1.1.50"
[331] "B.1.1.33" "N.2" "N.3" "N.7" "N.5"
[336] "N.8" "N.4" "B.1.1.409" "B.1.1.61" "B.1.1.220"
[341] "R.1" "B.1.1.317" "B.1.1.464.1" "B.1.1.316" "B.1.1.254"
[346] "B.1.1.267" "B.1.1.241" "B.1.1.25" "B.1.1.181" "B.1.1.434"
[351] "B.1.1.243" "B.1.1.159" "B.1.1.411" "B.1.1.232" "B.1.1.459"
[356] "B.1.1.111" "B.1.1.294" "B.1.1.373" "B.1.1.413" "B.1.1.354"
[361] "B.1.1.54" "B.1.1.325" "B.1.1.399" "B.1.1.46" "B.1.1.274"
[366] "B.1.1.410" "B.1.1.315" "AD.2" "B.1.1.142" "B.1.1.485"
[371] "B.1.1.200" "B.1.1.339" "AL.1" "B.1.1.163" "B.1.1.378"
[376] "B.1.1.39" "B.1.1.440" "B.1.1.297" "B.1.1.397" "B.1.1.207"
[381] "B.1.1.391" "B.1.1.348" "B.1.1.121" "B.1.1.512" "B.1.1.136"
[386] "B.1.1.288" "B.1.1.368" "B.1.1.328" "B.1.1.324" "B.1.1.360"
[391] "B.1.1.359" "B.1.1.374" "B.1.1.12" "B.1.1.407" "B.1.1.500"
[396] "B.1.1.432" "B.1.301" "B.1.1.37" "B.1.1.396" "B.1.1.330"
[401] "B.1.1.280" "B.1.1.141" "B.1.1.273" "B.1.1.514" "B.1.1.311"
[406] "B.1.1.377" "B.1.1.351" "B.1.1.350" "B.1.1.266" "B.1.1.225"
[411] "B.1.1.371" "B.1.1.438" "B.1.1.153" "B.1.1.134" "B.1.1.439"
[416] "B.1.1.152" "B.1.1.275" "B.1.406" "B.1.1.261" "P.1"
[421] "B.1.1.263" "B.1.1.349" "B.1.1.214" "B.1.326" "B.1.1.269"
[426] "D.2" "B.1.1.381" "B.1.1.448" "B.1.1.388" "B.1.1.157"
[431] "B.1.1.394" "B.1.1.218" "B.1.1.27" "B.1.1.70" "B.1.1.204"
[436] "B.1.1.301" "B.1.1.326" "B.1.1.120" "B.1.1.389" "B.1.1.401"
[441] "B.1.1.404" "B.1.1.416" "B.1.1.216" "AM.1" "AM.4"
[446] "B.1.1.34" "B.1.1.198" "B.1.1.319" "B.1.1.7" NA
B
A
A.6
A.2
A.2.4
A.2.5
A.5
A.1
A.21
A.18
A.12
A.28
A.19
A.25
A.23
A.23.1
B.12
B.56
B.4.7
B.31
B.61
B.40
B.35
B.28
B.11
B.55
B.27
B.3
B.4
B.4.8
B.4.2
B.53
B.6
B.1
B.6.6
B.6.1
B.6.2
B.6.8
B.1.256
B.1.1
B.1.1.420
B.1.240
B.1.205
B.1.128
B.1.221
B.1.223
B.1.214.2
B.1.214
B.1.146
B.1.147
B.1.104
B.1.391
B.1.380
B.1.203
B.1.540
B.1.416
B.1.600
B.1.160
B.1.195
B.1.609
B.1.561
B.1.232
B.1.558
B.1.239
B.1.237
B.1.597
B.1.480
B.1.397
B.1.408
B.1.243
B.1.240.2
B.1.240.1
B.1.524
B.1.179
B.1.1.298
B.1.400
B.1.398
B.1.236
B.1.234
B.1.459
B.1.131
B.1.525
B.1.563
B.1.411
B.1.379
B.1.177.31
B.1.177
B.1.177.73
B.1.177.11
B.1.177.14
B.1.177.15
AA.1
B.1.177.44
B.1.177.39
B.1.177.40
B.1.177.35
B.1.177.82
B.1.177.87
B.1.177.75
B.1.177.45
B.1.177.28
B.1.177.32
B.1.177.37
B.1.177.72
B.1.117.2
B.1.177.4
B.1.177.53
W.2
W.1
B.1.177.50
B.1.177.81
B.1.177.58
B.1.177.55
B.1.177.56
B.1.177.57
B.1.177.51
B.1.177.52
B.1.177.21
B.1.177.77
B.1.177.6
B.1.177.7
B.1.177.86
B.1.177.8
B.1.177.62
B.1.177.60
U.3
B.1.177.79
B.1.9.5
B.1.9
B.1.439
B.1.111
B.1.162
B.1.438
B.1.164
B.1.36
B.1.441
B.1.36.35
B.1.468
B.1.462
B.1.471
B.1.469
B.1.466
B.1.466.2
B.1.470
B.1.160.16
B.1.456
B.1.36.29
B.1.36.31
B.1.36.1
B.1.36.8
B.1.36.26
B.1.36.34
B.1.36.7
B.1.36.16
B.1.36.36
B.1.36.38
B.1.36.37
B.1.36.18
B.1.36.10
B.1.476
B.1.281
B.1.273
B.1.544
B.1.270
B.1.277
B.1.279
B.1.314
B.1.367
B.1.370
B.1.413
B.1.316
B.1.3
B.1.509
B.1.504
B.1.362
B.1.499
B.1.615
B.1.611
B.1.505
B.1.422
B.1.596.1
B.1.362.2
B.1.369
B.1.564
B.1.564.1
B.1.369.1
B.1.1.1
B.1.428
B.1.356
B.1.601
B.1.562
B.1.360
B.1.311
B.1.598
B.1.350
B.1.426
B.1.265
B.1.336
B.1.589
B.1.567
B.1.588.1
B.1.595
B.1.596
B.1.2
B.1.349
B.1.366
B.1.576
B.1.603
B.1.526.2
B.1.526
B.1.291
B.1.575
B.1.428.3
B.1.428.1
B.1.428.2
B.1.427
B.1.429
B.1.495
B.1.517
B.1.324
B.1.298
B.1.573
B.1.320
B.1.361
B.1.582
B.1.612
B.1.351
B.1.1.255
B.1.1.430
B.1.1.306
AE.7
AE.2
AE.5
AE.6
AE.4
AE.8
B.1.153
B.1.329
B.1.535
B.1.313
B.1.170
B.1.22.1
B.1.201
B.1.227
B.1.415
B.1.94
B.1.1.519
B.1.1.222
B.1.552
B.1.199
B.1.178
B.1.220
B.1.6
B.1.84
B.1.1.170
B.1.93
B.1.91
B.1.219
B.1.211
B.1.388
B.1.210
B.1.555
B.1.206
B.1.260
B.1.420
B.1.192
B.1.258
B.1.258.17
B.1.242
B.1.126
B.1.378
B.1.565
B.1.139
B.1.8
B.1.78
B.1.396
B.1.560
B.1.527
B.1.142
B.1.393
B.1.187
B.1.149
B.1.549
B.1.545
B.1.189
B.1.528
B.1.23
B.1.67
B.1.547
P.2
B.1.1.28
B.1.390
B.1.395
B.1.530
C.35
B.1.233
B.1.543
B.6.7
B.1.1.369
B.1.1.372
C.8
C.36
C.36.1
C.16
C.23
C.1.1
B.1.1.375
C.2.1
C.12
B.1.1.161
B.1.1.87
B.1.1.192
B.1.1.71
B.1.1.10
L.3
L.1
B.1.1.176
B.1.1.398
B.1.1.417
B.1.1.63
B.1.1.422
B.1.1.487
B.1.1.312
B.1.1.331
B.1.1.58
B.1.1.277
K.1
B.1.1.30
B.1.1.516
B.1.1.50
B.1.1.33
N.2
N.3
N.7
N.5
N.8
N.4
B.1.1.409
B.1.1.61
B.1.1.220
R.1
B.1.1.317
B.1.1.464.1
B.1.1.316
B.1.1.254
B.1.1.267
B.1.1.241
B.1.1.25
B.1.1.181
B.1.1.434
B.1.1.243
B.1.1.159
B.1.1.411
B.1.1.232
B.1.1.459
B.1.1.111
B.1.1.294
B.1.1.373
B.1.1.413
B.1.1.354
B.1.1.54
B.1.1.325
B.1.1.399
B.1.1.46
B.1.1.274
B.1.1.410
B.1.1.315
AD.2
B.1.1.142
B.1.1.485
B.1.1.200
B.1.1.339
AL.1
B.1.1.163
B.1.1.378
B.1.1.39
B.1.1.440
B.1.1.297
B.1.1.397
B.1.1.207
B.1.1.391
B.1.1.348
B.1.1.121
B.1.1.512
B.1.1.136
B.1.1.288
B.1.1.368
B.1.1.328
B.1.1.324
B.1.1.360
B.1.1.359
B.1.1.374
B.1.1.12
B.1.1.407
B.1.1.500
B.1.1.432
B.1.301
B.1.1.37
B.1.1.396
B.1.1.330
B.1.1.280
B.1.1.141
B.1.1.273
B.1.1.514
B.1.1.311
B.1.1.377
B.1.1.351
B.1.1.350
B.1.1.266
B.1.1.225
B.1.1.371
B.1.1.438
B.1.1.153
B.1.1.134
B.1.1.439
B.1.1.152
B.1.1.275
B.1.406
B.1.1.261
P.1
B.1.1.263
B.1.1.349
B.1.1.214
B.1.326
B.1.1.269
D.2
B.1.1.381
B.1.1.448
B.1.1.388
B.1.1.157
B.1.1.394
B.1.1.218
B.1.1.27
B.1.1.70
B.1.1.204
B.1.1.301
B.1.1.326
B.1.1.120
B.1.1.389
B.1.1.401
B.1.1.404
B.1.1.416
B.1.1.216
AM.1
AM.4
B.1.1.34
B.1.1.198
B.1.1.319
B.1.1.7
NA```
<!-- rnb-output-end -->
<!-- rnb-chunk-end -->
<!-- rnb-text-begin -->
------------------------------------------------------------------------
### 1.3.1 Pick the tip attributes that are most interesting or relevant
For our purposes it looks like we will want to explore our data with:
- GISAID clade: A **g**lobal **i**nitiative on **s**haring **a**vian **i**nfluenza **d**ata scientific consortium of clade naming.
- Nextstrain clade: Computationally labeled clade information based on the [Nextstrain](https://nextstrain.org/) criteria and analysis of new and continuing strains from COVID genomic data.
- Emerging Nextstrain clade.
- PANGO lineage: a dynamic nomenclature of [SARS-CoV-2 lineages](https://en.wikipedia.org/wiki/Phylogenetic_Assignment_of_Named_Global_Outbreak_Lineages).
- Country: The country where the sample was collected.
- Admin Division: geographic location of the particular strain/case.
- Age: the age of the patients from which the sample was collected (if included)
- Collection Data: the date the data was published or collected.
- Strain: the name of the strain which we'll need for merging with our tree structure.
Let's pull that information down and add it to our `tbl_tree` before converting it to a `treedata` object.
<!-- rnb-text-end -->
<!-- rnb-chunk-begin -->
<!-- rnb-source-begin eyJkYXRhIjpbIiMgQWRkIHBoeWxvZ2VuZXRpYyBpbmZvcm1hdGlvbiB0byBvdXIgdHJlZSIsIlNDMl92YXJpYW50c190aW1lLnRyZWUgPC1cblxuICAjIFBhc3MgYWxvbmcgdGhlIG1ldGFkYXRhXG4gIFNDMl9tZXRhZGF0YS5kZiAlPiUgXG4gIFxuICAjIFNlbGVjdCBqdXN0IGEgaGFuZGZ1bCBvZiBhdHRyaWJ1dGVzIHRvIGdvIHRvIHRoZSB0cmVlXG4gIHNlbGVjdChTdHJhaW4sICdHSVNBSUQgY2xhZGUnLCAnRW1lcmdpbmcgTmV4dHN0cmFpbiBjbGFkZScsIFxuICAgICAgICAgJ1BBTkdPIGxpbmVhZ2UnLCBDb3VudHJ5LCAnQWRtaW4gRGl2aXNpb24nLCAnQ29sbGVjdGlvbiBEYXRhJywgQWdlKSAlPiUgXG4gIFxuICByZW5hbWUoc3RyYWluID0gU3RyYWluLFxuICAgICAgICAgR0lTQUlEID0gJ0dJU0FJRCBjbGFkZScsIFxuICAgICAgICAgZW1lcmdpbmdfbmV4dHN0cmFpbiA9ICdFbWVyZ2luZyBOZXh0c3RyYWluIGNsYWRlJyxcbiAgICAgICAgIFBBTkdPID0gJ1BBTkdPIGxpbmVhZ2UnLFxuICAgICAgICAgc3RyYWluX2NvdW50cnkgPSBDb3VudHJ5LFxuICAgICAgICAgc3RyYWluX2RpdmlzaW9uID0gJ0FkbWluIERpdmlzaW9uJywgXG4gICAgICAgICBzdHJhaW5fZGF0ZSA9ICdDb2xsZWN0aW9uIERhdGEnKSAlPiUgXG4gIFxuICAjIGZ1bGxfam9pbiB0byBvdXIgdHJlZSB0byBlbnN1cmUgbm8gZGF0YSBpcyBsb3N0IGFsdGhvdWdoIHlvdSBjb3VsZCBhbHNvIGNhcmVmdWxseSB1c2UgYSBsZWZ0X2pvaW5cbiAgZnVsbF9qb2luKHg9U0MyX3ZhcmlhbnRzX3RpbWUudGIsIHk9LiwgYnk9YyhcImxhYmVsXCIgPSBcInN0cmFpblwiKSkgJT4lIFxuICBcbiAgIyBDb252ZXJ0IHRvIGEgdGlkeXRyZWUgZm9ybWF0XG4gIGFzLnRyZWVkYXRhKCkiLCIiLCIjIExvb2sgYXQgdGhlIHJlc3VsdGluZyBzdHJ1Y3R1cmUiLCJzdHIoU0MyX3ZhcmlhbnRzX3RpbWUudHJlZSkiXX0= -->
```r
# Add phylogenetic information to our tree
SC2_variants_time.tree <-
# Pass along the metadata
SC2_metadata.df %>%
# Select just a handful of attributes to go to the tree
select(Strain, 'GISAID clade', 'Emerging Nextstrain clade',
'PANGO lineage', Country, 'Admin Division', 'Collection Data', Age) %>%
rename(strain = Strain,
GISAID = 'GISAID clade',
emerging_nextstrain = 'Emerging Nextstrain clade',
PANGO = 'PANGO lineage',
strain_country = Country,
strain_division = 'Admin Division',
strain_date = 'Collection Data') %>%
# full_join to our tree to ensure no data is lost although you could also carefully use a left_join
full_join(x=SC2_variants_time.tb, y=., by=c("label" = "strain")) %>%
# Convert to a tidytree format
as.treedata()
# Look at the resulting structure
str(SC2_variants_time.tree)
Formal class 'treedata' [package "tidytree"] with 11 slots
..@ file : chr(0)
..@ treetext : chr(0)
..@ phylo :List of 5
.. ..$ edge : int [1:12776, 1:2] 6628 6631 6632 6633 6636 6635 6638 6638 6639 6640 ...
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : NULL
.. .. .. ..$ : chr [1:2] "parent" "node"
.. ..$ edge.length: num [1:12776] 0.0757 0 0.0077 0 0 ...
.. ..$ tip.label : chr [1:6627] "Wuhan/WH01/2019" "USA/MA1/2020" "Canada/ON_ON-VIDO-01-2/2020" "Canada/ON_VIDO-01/2020" ...
.. ..$ node.label : chr [1:6150] "NODE_0008606" "NODE_0000001" "NODE_0000061" "USA/MA1/2020_travel_history" ...
.. ..$ Nnode : int 6150
.. ..- attr(*, "class")= chr "phylo"
..@ data : tibble [12,777 x 8] (S3: tbl_df/tbl/data.frame)
.. ..$ node : int [1:12777] 1 2 3 4 5 6 7 8 9 10 ...
.. ..$ GISAID : chr [1:12777] "L" "L" "L" "L" ...
.. ..$ emerging_nextstrain: chr [1:12777] "19A" "19A" "19A" "19A" ...
.. ..$ PANGO : chr [1:12777] "B" "B" "B" "B" ...
.. ..$ strain_country : chr [1:12777] "Asia" "USA" "Canada" "Canada" ...
.. ..$ strain_division : chr [1:12777] "Asia" "Massachusetts" "Ontario" "Ontario" ...
.. ..$ strain_date : Date[1:12777], format: "2019-12-26" "2020-01-29" ...
.. ..$ Age : num [1:12777] 44 21 NA NA 54 NA 47 47 23 55 ...
..@ extraInfo : tibble [0 x 0] (S3: tbl_df/tbl/data.frame)
Named list()
..@ tip_seq : NULL
..@ anc_seq : NULL
..@ seq_type : chr(0)
..@ tipseq_file: chr(0)
..@ ancseq_file: chr(0)
..@ info : list()
Take a quick look at the structure of this treedata object. What do
we see? The treedata class has converted our tibble back
into an S4 object with 11 slots. Each slot, is
essentially a placeholder for another object. We can access these slots
using the @ operator and we can further access sub elements
with the $ operator.
For instance, we can see there is a phylo slot which
looks suspiciously like our original
SC2_variants_time.phylo object. The rest of our data frame
information, which we just added to the tbl_tree before
converting it is stored in the data slot. There are
additional slots where we can add sequencing information for comparison
in different types of phylogenetic tree visualizations.
# Grab our phylo object
SC2_variants_time.tree@phylo
Phylogenetic tree with 6627 tips and 6150 internal nodes.
Tip labels:
Wuhan/WH01/2019, USA/MA1/2020, Canada/ON_ON-VIDO-01-2/2020, Canada/ON_VIDO-01/2020, USA/CA-CDC-5/2020, Taiwan/CGMH-CGU-02/2020, ...
Node labels:
NODE_0008606, NODE_0000001, NODE_0000061, USA/MA1/2020_travel_history, NODE_0000062, Canada/ON_VIDO-01/2020_travel_history, ...
Unrooted; includes branch lengths.
# Take a peek at our associated node data
head(SC2_variants_time.tree@data)
ggtree() from the
ggtree packageNow that we’ve put some extra data into our tree, we are ready to
plot it with the help of the ggtree() function from the
ggtree package. If you haven’t noticed yet,
treeio, tidytree and ggtree form
a suite of packages that we can use to import, alter, and visualize our
trees.
Parameters for ggtree() include:
tr: the phylo tree object
layout: the shape of the tree including:
rectangular, dendrogram, slanted, ellipse, fan, circular,
inward_circular, and radial.
mrsd: most recent sampling date used for setting a
time-based scale
as.Date: logical to specify if date will be in
calendar vs decimal-date format
branch.length: variable for scaling branch
length
root.position: the position of our root node
(default = 0)
You’ll notice that ggtree() appears to act very much
like the ggplot() command. We’ll be able to add layers to
the base visualization much like we do with other ggplot objects. We’ll
also use theme_tree2() which will automatically add an
x-axis scale to our tree.
We’ll start simple by just visualizing all 6,627 tips of our tree.
# Show the entire tree
SC2_variants_time.tree %>%
# 1. Data
ggtree(tr = .) +
# 2. Aesthetics
aes(color = emerging_nextstrain) +
# Themes
theme_tree2() +
theme(text = element_text(size = 20),
legend.position = "bottom") + # Move our legend to the bottom
# Provide some labels
labs(x="Time",
title = "Canada and US sequenced strain phylogeny") +
# Make our guide lines a little thicker
guides(colour = guide_legend(title = "Nextstrain\nclade",
override.aes = list(size=2)))
Error in ggtree(tr = .): could not find function "ggtree"
drop.tip()Looking at our visualization, there are a lot of tips to display and we’ve coloured the tree along the branches. Another thing we can see is that there is an “NA” value. This mostly represents the branches of the tree connecting nodes internally since they do not have a specific lineage associated from our data.
As it stands, there’s just way too much data here to look at and work
with. We can trim that data using the drop.tip() function
without worrying about how the tree is parsed. Internally, all that
pruning will take place within the function. To do this, we’ll generate
a list of tips we want to remove first.
# You can make a list of tips to drop when you're making your ggtree
# Generate a table of tree information to drop
to_drop <-
SC2_metadata.df %>%
# Filter for strains that are not from Canada and are from before 2020-11-01
filter(!(Country == "Canada" &
`Collection Data` >= as.Date("2020-11-01")
)) %>%
# Just keep the strain information
pull(Strain)
# Take a look at what we got back
str(to_drop)
chr [1:6199] "Wuhan/WH01/2019" "USA/MA1/2020" ...
# At the same time make a dataset to keep. You'll use this to update the information in the tree later
to_keep <- SC2_metadata.df %>% filter(!Strain %in% to_drop)
# Which tips are we keeping?
str(to_keep, give.attr = FALSE)
spc_tbl_ [428 x 24] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Strain : chr [1:428] "Canada/NS-NML-16223/2021" "Canada/NB-NML-3294/2021" "Canada/NS-NML-16211/2021" "Canada/NB-NML-5584/2021" ...
$ GISAID clade : chr [1:428] "S" "S" "S" "S" ...
$ Nextstrain clade : chr [1:428] "19B" "19B" "19B" "19B" ...
$ Age : num [1:428] 23 25 35 29 61 79 NA NA NA NA ...
$ Clade : chr [1:428] "19B" "19B" "19B" "19B" ...
$ Country : chr [1:428] "Canada" "Canada" "Canada" "Canada" ...
$ Admin Division : chr [1:428] "Nova Scotia" "New Brunswick" "Nova Scotia" "New Brunswick" ...
$ Admin Division of exposure: chr [1:428] "Nova Scotia" "New Brunswick" "Nova Scotia" "New Brunswick" ...
$ gisaid_epi_isl : chr [1:428] "EPI_ISL_1055741" "EPI_ISL_961659" "EPI_ISL_1055730" "EPI_ISL_961643" ...
$ Host : chr [1:428] "Human" "Human" "Human" "Human" ...
$ Originating Lab : chr [1:428] "QEII Health Sciences Centre" "Hôpital Georges L. Dumont" "QEII Health Sciences Centre" "Hôpital Georges L. Dumont" ...
$ PANGO lineage : chr [1:428] "A.2.5" "A.23.1" "A.23.1" "A.23.1" ...
$ Submission Date : chr [1:428] "Older" "Older" "Older" "Older" ...
$ Region : chr [1:428] "North America" "North America" "North America" "North America" ...
$ Region of exposure : chr [1:428] "North America" "North America" "North America" "North America" ...
$ Sex : chr [1:428] "Male" "Female" "Female" "Male" ...
$ strain : chr [1:428] "Canada/NS-NML-16223/2021" "Canada/NB-NML-3294/2021" "Canada/NS-NML-16211/2021" "Canada/NB-NML-5584/2021" ...
$ Emerging Nextstrain clade : chr [1:428] "19B" "19B" "19B" "19B" ...
$ Submitting Lab : chr [1:428] "National Microbiology Laboratory (NML)" "National Microbiology Laboratory (NML)" "National Microbiology Laboratory (NML)" "National Microbiology Laboratory (NML)" ...
$ url : logi [1:428] NA NA NA NA NA NA ...
$ Collection Data : Date[1:428], format: "2021-01-16" "2021-01-02" ...
$ Author : chr [1:428] "Anna Majer et al" "Anna Majer et al" "Anna Majer et al" "Anna Majer et al" ...
$ Country of exposure : chr [1:428] NA NA NA NA ...
$ Location : chr [1:428] NA NA NA NA ...
geom_*
layersThe ggtree package brings a number of
ggplot2-compatible geoms to our finger-tips. We’ll
spruce up our tree with some tip labels to begin with. We’ll accomplish
this with the geom_tiplab() layer.
| Component type | Layer type | Command | Description |
|---|---|---|---|
| Tree | Geom | geom_tree | tree structure layer, with multiple layout supported |
| Tree | Geom | geom_treescale | tree branch scale legend |
| Node | Geom | geom_nodepoint | annotate internal nodes with symbolic points |
| Node | Annotation | geom_range | bar layer to present uncertainty of evolutionary inference |
| Node | Annotation | geom_rootpoint | annotate root node with symbolic point |
| Branch | Annotation | geom_label2 | modified version of geom_label, with subsetting supported for labelling branches |
| Branch | Annotation | geom_segment2 | modified version of geom_segment, with subsetting supported |
| Taxa | Geom | geom_point2 | modified version of geom_point, with subsetting supported |
| Taxa | Geom | geom_tippoint | annotate external nodes with symbolic points |
| Taxa | Annotation | geom_taxalink | associate two related taxa by linking them with a curve |
| Taxa | Annotation | geom_text2 | modified version of geom_text, with subsetting supported |
| Taxa | Annotation | geom_tiplab | layer of tip labels |
| Taxa | Annotation | geom_tiplab2 | layer of tip labels for circular layout |
| Clade | Annotation | geom_balance | highlights the two direct descendant clades of an internal node |
| Clade | Annotation | geom_cladelabel | annotate a clade with bar and text label |
| Clade | Annotation | geom_hilight | highlight a clade with rectangle |
| Clade | Annotation | geom_strip | annotate associated taxa with bar and (optional) text label |
# Show our pruned tree
SC2_variants_time.tree %>%
# drop tips based on our list
drop.tip(., to_drop) %>%
# 1. Data
ggtree(tr = .) + # Need the most recent sampling date
# 2. Aesthetics
aes(color = emerging_nextstrain) +
# Themes
theme_tree2() +
theme(text = element_text(size = 20),
legend.position = "bottom") + # Move our legend to the bottom
# Provide some labels
labs(x="Time",
title = "Subset of Canadian sequenced strains") +
# Make our guide lines a little thicker
guides(colour = guide_legend(title = "Nextstrain\nclade",
override.aes = list(size=2))) +
# 4. Geoms
### 1.6.0 Add tips labels to our tree tips
geom_tiplab()
Error in ggtree(tr = .): could not find function "ggtree"
In a tree this packed, you can’t really read the tip labels. The data that’s most helpful, however, is the colouring. You can replace your labels with points in 3 ways:
geom_point()
geom_nodepoint()
geom_tippoint()
If we use geom_point() we can colour/annotate all of the
nodes and tips but we already know that there isn’t any phylogenetic
information associated with our internal nodes. Likewise,
geom_nodepoint() will allow us to colour the nodes but
there isn’t any grouping information associated with that.
Therefore we’ll use geom_tippoint() to colour our tips
based on their Nextstrain clade information and we can alter the tip
shape to match the province where the strain was sequenced.
# Show entire tree
SC2_variants_time.tree %>%
drop.tip(., to_drop) %>%
# 1. Data
ggtree(tr = .) + # Need the most recent sampling date
# 2. Aesthetics
aes(color = emerging_nextstrain) +
# Themes
theme_tree2() +
theme(text = element_text(size = 20),
legend.position = "bottom", # Move our legend to the bottom
legend.box = "vertical") + # Stack legends vertically
# Provide some labels
labs(x="Time",
title = "Subset of Canadian sequenced strains") +
# Make our guide lines a little thicker
guides(colour = guide_legend(title = "Nextstrain\nclade",
override.aes = list(size=2)),
shape = guide_legend(title = "Strain\nlocation")
) +
# 3. Scaling
scale_shape_manual(values = c(15:20, 11)) +
# 4. Geoms
# Add tips labels to our tree tips
### 1.6.1 Change our tip shape by strain_division
geom_tippoint(aes(shape = strain_division), alpha = 0.7, size = 3) # Add tips only
Error in ggtree(tr = .): could not find function "ggtree"
mrsd parameterAs you can see we are working with a time-based axis but it all
appears to be on a relative scale and what we’d really like to see is
real-world time. In order to do that, we can assign the
mrsd parameter to the most recent sampling date in our
dataset. We’ll also set our parameter as.Date in order to
see the date in a calendar format rather than a decimal-date format.
# Show entire tree
SC2_variants_time.tree %>%
drop.tip(., to_drop) %>%
# 1. Data
ggtree(tr = .,
mrsd = "2021-02-17", ### 1.6.2 Set the most recent sampling date
as.Date = TRUE) + ### 1.6.2 Make sure it's set as a data
# 2. Aesthetics
aes(color = emerging_nextstrain) +
# Themes
theme_tree2() +
theme(text = element_text(size = 20),
legend.position = "bottom", # Move our legend to the bottom
legend.box = "vertical") + # Stack legends vertically
# Provide some labels
labs(x="Time",
title = "Subset of Canadian sequenced strains") +
# Make our guide lines a little thicker
guides(colour = guide_legend(title = "Nextstrain\nclade",
override.aes = list(size=2)),
shape = guide_legend(title = "Strain\nlocation")
) +
# 3. Scaling
scale_shape_manual(values = c(15:20, 11)) +
# 4. Geoms
# Add tips labels to our tree tips
# Change our tip shape by strain_division
geom_tippoint(aes(shape = strain_division), alpha = 0.7, size = 3) # Add tips only
Error in ggtree(tr = ., mrsd = "2021-02-17", as.Date = TRUE): could not find function "ggtree"
viewClade() and
other information functionsLooking at the tree there are still quite a few nodes but we can zoom
in on a specific clade using the viewClade() function. To
accomplish this we need to access the most recent common ancestor (MRCA)
of a group. At the same time, we should talk about ways to traverse or
navigate this tree. Here’s where tidytree offers up a few
additional functions for searching your tbl_tree. These
take on the form of function(tbl_tree, node_number) or
function(tbl_tree, node_label)
| function | Description |
|---|---|
| child() | Find the children of an internal node. |
| parent() | Find the parent node of a tip or other node. |
| offspring() | Find all of the offspring of a node. |
| ancestor() | Find all of the ancestors of a tip or node |
| sibling() | Find the nearest sibling of a tip (or node) |
| MRCA() | Find the most recent common ancestor between two or more nodes |
Let’s filter our taxa list a little more searching for variants of the clade 501Y.V1 and 501Y.V2. Then we’ll find the MRCA between those tips.
# You can make a list of tips to drop when you're making your ggtree
# Generate a table of tree information to drop
to_view <-
SC2_metadata.df %>%
# Filter our tip information
filter(`Emerging Nextstrain clade` %in% c("20I/501Y.V1", "20H/501Y.V2") &
Country == "Canada" &
`Collection Data` >= as.Date("2021-02-16")
) %>%
# Just keep the strain names
pull(Strain)
# What's the list look like?
to_view
[1] "Canada/NL-NML-17170/2021" "Canada/NL-NML-17182/2021"
[3] "Canada/NL-NML-17173/2021" "Canada/NL-NML-17193/2021"
[5] "Canada/NL-NML-17157/2021" "Canada/NL-NML-17185/2021"
[7] "Canada/NL-NML-17189/2021" "Canada/NL-NML-17196/2021"
[9] "Canada/NL-NML-17184/2021" "Canada/NL-NML-17166/2021"
[11] "Canada/NL-NML-17191/2021" "Canada/NL-NML-17172/2021"
[13] "Canada/NL-NML-17187/2021" "Canada/NL-NML-17168/2021"
[15] "Canada/NL-NML-17194/2021" "Canada/NL-NML-17158/2021"
[17] "Canada/NL-NML-17183/2021" "Canada/NL-NML-17165/2021"
Canada/NL-NML-17170/2021
Canada/NL-NML-17182/2021
Canada/NL-NML-17173/2021
Canada/NL-NML-17193/2021
Canada/NL-NML-17157/2021
Canada/NL-NML-17185/2021
Canada/NL-NML-17189/2021
Canada/NL-NML-17196/2021
Canada/NL-NML-17184/2021
Canada/NL-NML-17166/2021
Canada/NL-NML-17191/2021
Canada/NL-NML-17172/2021
Canada/NL-NML-17187/2021
Canada/NL-NML-17168/2021
Canada/NL-NML-17194/2021
Canada/NL-NML-17158/2021
Canada/NL-NML-17183/2021
Canada/NL-NML-17165/2021
# Calculate the MRCA from a subset of our tips
to_view_MRCA <-
# Pass along our tree information
SC2_variants_time.tree %>%
# Get rid of the tips we would before (to match the tree we'll plot)
drop.tip(., to_drop) %>%
# Determine the MRCA between the first 4 tips in the list
MRCA(., to_view[1:4])
[33m![39m # Invaild edge matrix for [34m[34m<phylo>[34m[39m. A [34m[34m<tbl_df>[34m[39m is returned.
# What kind of value does MRCA() return?
to_view_MRCA
[1] 727
# How many offspring does it have?
# Pass along our tree information
SC2_variants_time.tree %>%
# Drop our original set of tips
drop.tip(., to_drop) %>%
# make a tibble so we can see all the relevant information
as_tibble() %>%
# Find the offspring of a specific node
offspring(., to_view_MRCA) %>%
# Convert to a tree
as.treedata() %>%
str()
[33m![39m # Invaild edge matrix for [34m[34m<phylo>[34m[39m. A [34m[34m<tbl_df>[34m[39m is returned.
Formal class 'treedata' [package "tidytree"] with 11 slots
..@ file : chr(0)
..@ treetext : chr(0)
..@ phylo :List of 5
.. ..$ edge : int [1:71, 1:2] 728 730 731 732 732 734 735 735 736 737 ...
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : NULL
.. .. .. ..$ : chr [1:2] "parent" "node"
.. ..$ edge.length: num [1:71] 0.1136 0.067 0.0521 0.0306 0.0333 ...
.. ..$ tip.label : chr [1:39] "Canada/NL-NML-16819/2021" "Canada/NL-NML-16824/2021" "Canada/NL-NML-17178/2021" "Canada/NL-NML-16829/2021" ...
.. ..$ node.label : chr [1:32] "NODE_0006538" "NODE_0006539" "NODE_0006540" "NODE_0006541" ...
.. ..$ Nnode : int 32
.. ..- attr(*, "class")= chr "phylo"
..@ data : tibble [71 x 8] (S3: tbl_df/tbl/data.frame)
.. ..$ node : int [1:71] 323 324 325 326 327 328 329 330 331 332 ...
.. ..$ GISAID : chr [1:71] "GRY" "GRY" "GRY" "GRY" ...
.. ..$ emerging_nextstrain: chr [1:71] "20I/501Y.V1" "20I/501Y.V1" "20I/501Y.V1" "20I/501Y.V1" ...
.. ..$ PANGO : chr [1:71] "B.1.1.7" "B.1.1.7" "B.1.1.7" "B.1.1.7" ...
.. ..$ strain_country : chr [1:71] "Canada" "Canada" "Canada" "Canada" ...
.. ..$ strain_division : chr [1:71] "Newfoundland and Labrador" "Newfoundland and Labrador" "Newfoundland and Labrador" "Newfoundland and Labrador" ...
.. ..$ strain_date : Date[1:71], format: "2021-02-10" "2021-02-12" ...
.. ..$ Age : num [1:71] 47 16 26 62 60 45 19 45 35 31 ...
..@ extraInfo : tibble [0 x 0] (S3: tbl_df/tbl/data.frame)
Named list()
..@ tip_seq : NULL
..@ anc_seq : NULL
..@ seq_type : chr(0)
..@ tipseq_file: chr(0)
..@ ancseq_file: chr(0)
..@ info : list()
ggplot2 object to
viewClade()We see that our chosen MRCA node (727) has 39 children based on the
number of tip labels in our treedata object. We can also
now use this MRCA to determine the specific clade we can see on the tree
but this will still be quite large. To view this particular clade, we
build our plot as before, and then zoom into it afterwards with the
viewClade() function.
Unfortunately, we’ll have to ditch our date scale and revert to a decimal-date. If you’re not using a calendar-based date, it’s not a problem at all. If we look carefully at the data, we’ll see that the internal nodes have an NA-value. If you were industrious enough, you could use the branch lengths to traverse the tree and calculate dates for all the internal nodes as well. This would likely solve the issue.
Let’s drop the tip names back in there too.
# Show a small clade of the tree
tree.plot <-
SC2_variants_time.tree %>%
# From our tree, drop the Non-Canadian tips
drop.tip(., to_drop) %>%
# 1. Data
ggtree(tr = ., mrsd = "2021-02-16") + # Need the most recent sampling date
# 2. Aesthetics
aes(color = emerging_nextstrain) +
# Themes
theme_tree2() +
theme(text = element_text(size = 30),
legend.position = "bottom", # Move our legend to the bottom
legend.box = "vertical") + # Stack legends vertically
# Provide some labels
labs(x="Time",
title = "Subset of Canadian sequenced strains") +
# Make our guide lines a little thicker
guides(colour = guide_legend(title = "Nextstrain\nclade",
override.aes = list(size=2)),
shape = guide_legend(title = "Strain\nlocation")
) +
# 3. Scaling
scale_shape_manual(values = c(15:20, 11)) +
# 4. Geoms
geom_tiplab(size = 7, align=TRUE) + # Add labels in and right-align them
# Add tips labels to our tree tips
# Change our tip shape by strain_division
geom_tippoint(aes(shape = strain_division), size = 3) # Add tips only
Error in ggtree(tr = ., mrsd = "2021-02-16"): could not find function "ggtree"
### 1.7.1 View the clade we made using the same MRCA call we already used.
viewClade(tree.plot, MRCA(tree.plot, to_view[1:5]))
Error in viewClade(tree.plot, MRCA(tree.plot, to_view[1:5])): could not find function "viewClade"
So we’ve looked at a few ways to subset and plot our phylogenetic
tree. We’ll do a quick aside to create a more curated list of tips with
representation from across multiple clades. We’ll use this as the base
tree for our next few graphs so that we can really see the power of
using additional information from our treedata object.
# Here's our representative list of tips from multiple clades
curated_tips <-
c('Wuhan/WH01/2019',
'Canada/ON_ON-VIDO-01-2/2020',
'Canada/ON_VIDO-01/2020',
'Canada/BC_37_0-2/2020',
'Canada/BC_69243/2020',
'Canada/QC-CHUM-2008003956A/2020',
'Canada/NL-NML-1387/2020',
'Canada/MB-NML-808/2020',
'Canada/NB-NML-16967/2021',
'Canada/BC_6129127/2020',
'Canada/AB-97776/2020',
'Canada/AB-12948/2020',
'Canada/BC-BCCDC-3564/2020',
'Canada/ON-S67/2020',
'Canada/AB-65233/2020',
'Canada/MB-NML-1057/2020',
'Canada/ON-UHTC-0366/2020',
'Canada/NS-NML-16218/2021',
'Canada/NB-NML-16966/2021',
'Canada/NS_13/2020',
'Canada/Qc-CHUM-2019203856A/2020',
'Canada/NB-NML-3272/2020',
'Canada/ON-UHTC-0267/2020',
'Canada/MB-NML-1037/2020',
'Canada/NS-NML-5213/2020',
'Canada/NS-NML-5141/2021',
'Canada/NS-NML-5140/2021',
'Canada/NS-NML-16208/2021',
'Canada/BC-BCCDC-7012/2020',
'Canada/ON-LTRI-1372/2020',
'Canada/BC-BCCDC-9736/2021',
'Canada/ON-S2125/2021',
'Canada/NS-NML-5181/2020',
'Canada/ON-PPS-21012021_0269/2021',
'Canada/QC-L00314539/2020',
'Canada/NS-NML-16238/2021',
'Canada/NL-NML-16822/2021',
'Canada/QC-L00324589001/2021',
'Canada/NL-NML-17194/2021')
# Make a new list of tips to drop
curated_drop_list <-
SC2_metadata.df %>%
# filter the opposite of our tip labels
filter(!(Strain %in% curated_tips)) %>%
# Only keep the strain information for what we want to drop
pull(Strain)
# Drop tips from our tree and save it as curated_tree_info.df
curated_tree_info.df <-
# Pass along our original tree
SC2_variants_time.tree %>%
# Drop the tips we created
drop.tip(., ...) %>%
# Convert to a tibble
as_tibble() %>%
# Rename our column information
rename(Clade = ..., Province = ...) %>%
mutate(Age = replace_na(Age, replace = 0)) %>%
# Convert this to a data frame
as.data.frame()
Error in mutate(., Age = replace_na(Age, replace = 0)): '...' used in an incorrect context
# Set the rownames from the label information
rownames(curated_tree_info.df) <- curated_tree_info.df$label
Error in eval(expr, envir, enclos): object 'curated_tree_info.df' not found
# Let's view the tree
str(curated_tree_info.df)
Error in str(curated_tree_info.df): object 'curated_tree_info.df' not found
layout
parameterSo we’ve only been working with a rectangular tree (horizontal layout) thus far but as we mentioned at the beginning there are actually a number of different layouts we can use. Our goal now is to make a sort of sunburst plot using a circular layout that we’ll add visual categorical information to in a ring pattern.
First, we’ll start with the circular version. We’ll colour our tips by the province where the case or data was generated and label them by the Nextstrain clade information.
# Show our curated tree in a circular format
tree.plot <-
SC2_variants_time.tree %>%
drop.tip(., curated_drop_list) %>%
# 1. Data
ggtree(., layout = ..., ### 1.9.0 Set the layout to circular
mrsd = "2021-02-17") + # Need the most recent sampling date
# 2. Aesthetics
# Themes
theme(text = element_text(size = 20),
legend.position = "bottom") +
labs(title = "Canada and US sequenced strain phylogeny",
colour = "Province") +
scale_colour_viridis_d(option = "plasma") + # This seems to literally kill the kernel!
# 4. Geoms
geom_tippoint(aes(colour = strain_division), size = 5) +
### 1.9.0 set the tip labels
geom_tiplab(aes(label = ...), size = 8)
Error in ggtree(., layout = ..., mrsd = "2021-02-17"): could not find function "ggtree"
# View our tree
tree.plot
Error in eval(expr, envir, enclos): object 'tree.plot' not found
gheatmap() layer to add information to your
plotNow that we have our basic circular tree, we can use the information
from curated_tree_info.df to visualize additional data on
our tree. In order to use the gheatmap layer properly, we
must pass it our original tree plot, along with a new dataframe (or
vector) that holds the information we want to add. It should use the
rownames from the data frame to help match up to the tips in our
tree.
We’ll drop the tip labels from our tree in favour of the heatmap we’ll add. By adding layers of hierarchical data, this creates what is known as a Sunburst plot.
The gheatmap() layer takes some of the following parameters:
p: the tree plot we want to modify.
data: the data used to modify the tree plot
p.
offset: determines the offset of the heatmap
relative to the tree.
width: determines the total width of the heatmaps
compared to the tree
colnames_angle: determines the angle of the text for
the heatmap.
font.size: sets the font size for the heatmap
portion of the tree.
# Show our curated tree in a circular format
tree.plot <-
SC2_variants_time.tree %>%
drop.tip(., curated_drop_list) %>%
# 1. Data
ggtree(., layout = "circular", mrsd = "2021-02-17") + # Need the most recent sampling date
# 2. Aesthetics
# Themes
theme(text = element_text(size = 20),
legend.position = "bottom") +
labs(title = "Canada and US sequenced strain phylogeny",
colour = "Province") +
scale_colour_viridis_d(option = "plasma") +
# 4. Geoms
geom_tippoint(aes(colour = strain_division), size = 5) # Add tips only
Error in ggtree(., layout = "circular", mrsd = "2021-02-17"): could not find function "ggtree"
### 1.10.0 Add a circular heatmap with gheatmap()
tree.plot <- ...(tree.plot, ...,
offset = 0.1, width = 0.1,
colnames_angle = 90, font.size = 6, hjust = 1) +
scale_fill_continuous(name = "Tree coding")
Error in ...(tree.plot, ..., offset = 0.1, width = 0.1, colnames_angle = 90, : could not find function "..."
tree.plot
Error in eval(expr, envir, enclos): object 'tree.plot' not found
ggnewscale package allows multiple color/fill
scales on your Sunburst plotFrom our sunburst plot above we used multiple columns from
curated_tree_info.df but the caveat is that the legend for
each resulting column combined all of the data into a single new legend.
That’s helpful if the type of data could be continuous values
like a [0,1] scale heatmap colour across different features. When
dealing with multiple categories where the scale of the values varies,
however, that doesn’t work well for us. This also doesn’t work well for
categorical data.
Instead, we’d like to separate the colour/fill scales by repeatedly
adding to our plot, layer by layer. As you might recall, however, we
also run into the problem of scale_colour_* and
scale_fill_* layers overwriting any previous
layers.
To circumvent this reality, we’ll use the ggnewscale
package to generate new colour and fill scales with
new_scale_fill() and new_scale_colour(). It
can make the process slightly more encumbering but it will work out.
We’ll add back in our original tip labels as well for this graph.
# Show our curated tree in a circular format
# We'll need some extra colour sets to accomplish this plot
combo.colours = c(brewer.pal(12, "Paired"), brewer.pal(8, "Set1"), brewer.pal(12, "Set3"))
# Save our first plot with just the tree
tree.plot <-
SC2_variants_time.tree %>%
drop.tip(., curated_drop_list) %>%
# 1. Data
ggtree(., layout = "circular", mrsd = "2021-02-17") + # Need the most recent sampling date
# 2. Aesthetics
# Themes
theme(text = element_text(size = 30),
legend.position = "bottom") +
labs(title = "Canada and US sequenced strain phylogeny") +
# We'll want to ensure our colour guide ends up in the correct order
scale_colour_discrete(guide = guide_legend(title="Province", order=1)) +
# 4. Geoms
geom_tippoint(aes(colour = strain_division), size = 5) + # Add tips only
geom_tiplab(size = 10, align=TRUE, offset = 0.6) # Add in our labels
Error in ggtree(., layout = "circular", mrsd = "2021-02-17"): could not find function "ggtree"
# Generate our first heatmap layer
tree.plot <- gheatmap(tree.plot, curated_tree_info.df %>% select(parent, node, branch.length),
offset = 0.1, width = 0.1,
colnames_angle = 90, font.size = 6) +
# Set the name of our legend
scale_fill_continuous(name = "Tree coding",
guide = guide_legend(order = 2))
Error in gheatmap(tree.plot, curated_tree_info.df %>% select(parent, node, : could not find function "gheatmap"
### 1.10.1 use this code to create a new colour scale
tree.plot <- tree.plot + ...
Error in eval(expr, envir, enclos): object 'tree.plot' not found
# Generate a categorical heatmap layer for the Clade variable
tree.plot <- gheatmap(tree.plot, curated_tree_info.df %>% select(Clade),
offset = 0.2, width = 0.1,
colnames_angle = 90, font.size = 12) +
# Set the name and order of our Clade legend
scale_fill_discrete(name = "Nextstrain\nClade",
guide = guide_legend(order=3))
Error in gheatmap(tree.plot, curated_tree_info.df %>% select(Clade), offset = 0.2, : could not find function "gheatmap"
### 1.10.1 use this code to create a new colour scale
tree.plot <- tree.plot + ...
Error in eval(expr, envir, enclos): object 'tree.plot' not found
# Generate a categorical heatmap layer for the PANGO variable
tree.plot <- gheatmap(tree.plot, curated_tree_info.df %>% select(...),
offset = 0.35, width = 0.1,
colnames_angle = 90, font.size = 12) +
# Set the name and order of our PANGO legend
scale_fill_manual(values = combo.colours, name = "PANGO\nlineage",
guide=guide_legend(order=4))
Error in gheatmap(tree.plot, curated_tree_info.df %>% select(...), offset = 0.35, : could not find function "gheatmap"
# You may need to re-render to see it properly (Expand/collapse again)
tree.plot
Error in eval(expr, envir, enclos): object 'tree.plot' not found
Depending on the level of data depth, you may also choose to layer
your scales along the right-hand side of your tree. To accomplish this
we can simply set our layout parameter to “rectangular”. At
the same time, we’ll run into a couple of problems with how the
ggtree is made.
One issue with ggplot is that as we add the scales, these are very
much like annotations on our ggplot. The vertical (and horizontal)
display areas of our plot are not updated as we add these annotations.
When we add new colour scale layers to the plot, the labels will, in
many cases, bleed outside of the designated plot area. To combat this,
we can add a layer called vexpand() or
hexpand() for horizontal expansion. Both of these take 2
parameters:
ratio: the ratio of expansion for your plot axisdirection: the direction you want the expansion in a
range from 1 (left/top) to -1 (right/bottom).# Show our curated tree in a rectangular format
# We'll need some extra colour sets to accomplish this plot
combo.colours = c(brewer.pal(12, "Paired"), brewer.pal(8, "Set1"), brewer.pal(12, "Set3"))
# Save our first plot with just the tree
tree.plot <-
SC2_variants_time.tree %>%
drop.tip(., curated_drop_list) %>%
# 1. Data
ggtree(., layout = ..., ### 1.10.2 convert your plot to a rectangular format
mrsd = "2021-02-17") + # Need the most recent sampling date
# 2. Aesthetics
# Themes
theme(text = element_text(size = 30),
legend.position = "bottom",
) +
labs(title = "Canada and US sequenced strain phylogeny",
colour = "Province") +
### 1.10.2 Expand the vertical plot size
vexpand(ratio = ..., direction = ...) +
# We'll want to ensure our colour guide ends up in the correct order
scale_colour_discrete(guide = guide_legend(title="Province", order=1)) +
# 4. Geoms
geom_tippoint(aes(colour = strain_division), size = 5) + # Add tips only
geom_tiplab(size = 10, align=TRUE) # Add in our labels
Error in ggtree(., layout = ..., mrsd = "2021-02-17"): could not find function "ggtree"
# Generate our first heatmap layer
tree.plot <- gheatmap(tree.plot, curated_tree_info.df %>% select(parent, node, branch.length),
offset = 0.75, width = 0.1,
colnames_angle = 90, font.size = 8, hjust = 1) +
# Set the name of our legend
scale_fill_continuous(name = "Tree coding",
guide = guide_legend(order = 2))
Error in gheatmap(tree.plot, curated_tree_info.df %>% select(parent, node, : could not find function "gheatmap"
# use this code to create a new colour scale
tree.plot <- tree.plot + new_scale_fill()
Error in eval(expr, envir, enclos): object 'tree.plot' not found
# Generate a categorical heatmap layer for the Clade variable
tree.plot <- gheatmap(tree.plot, curated_tree_info.df %>% select(Clade),
offset = 0.85, width = 0.1,
colnames_angle = 90, font.size = 12, hjust = 1) +
# Set the name and order of our Clade legend
scale_fill_discrete(name = "Nextstrain\nClade",
guide = guide_legend(order=3))
Error in gheatmap(tree.plot, curated_tree_info.df %>% select(Clade), offset = 0.85, : could not find function "gheatmap"
# use this code to create a new colour scale
tree.plot <- tree.plot + new_scale_fill()
Error in eval(expr, envir, enclos): object 'tree.plot' not found
# Generate a categorical heatmap layer for the PANGO variable
tree.plot <- gheatmap(tree.plot, curated_tree_info.df %>% select(PANGO),
offset = 1.0, width = 0.1,
colnames_angle = 90, font.size = 12, hjust = 1) +
# Set the name and order of our PANGO legend
scale_fill_manual(values = combo.colours, name = "PANGO\nlineage",
guide=guide_legend(order=4))
Error in gheatmap(tree.plot, curated_tree_info.df %>% select(PANGO), offset = 1, : could not find function "gheatmap"
# View the tree
tree.plot
Error in eval(expr, envir, enclos): object 'tree.plot' not found
geom_taxalink()Suppose we want to join to tips/taxa together to suggest that there
is a non-tree relationship between them or to emphasize some association
between nodes. To connect nodes or tips of the tree, we can use the
geom_taxalink() layer.
Let’s join the nodes Canada/NL-NML-17194/2021 and
Canada/ON-UHTC-0366/2020 together.
# Show our curated tree in a rectangular format
# We'll need some extra colour sets to accomplish this plot
combo.colours = c(brewer.pal(12, "Paired"), brewer.pal(8, "Set1"), brewer.pal(12, "Set3"))
# Save our first plot with just the tree
tree.plot <-
SC2_variants_time.tree %>%
drop.tip(., curated_drop_list) %>%
# 1. Data
ggtree(., layout = "rectangular", mrsd = "2021-02-17") + # Need the most recent sampling date
# 2. Aesthetics
# Themes
theme(text = element_text(size = 30),
legend.position = "bottom",
) +
labs(title = "Canada and US sequenced strain phylogeny",
colour = "Province") +
vexpand(ratio = 0.1, direction = -1) + # Expand the vertical plot size
# We'll want to ensure our colour guide ends up in the correct order
scale_colour_discrete(guide = guide_legend(title="Province", order=1)) +
# 4. Geoms
geom_tippoint(aes(colour = strain_division), size = 5) + # Add tips only
geom_tiplab(size = 10, align=TRUE) + # Add in our labels
### 1.11.0 Add in some annotation between points
geom_taxalink(taxa1=..., taxa2=...,
color="blue", alpha=0.8, offset=0,
outward=FALSE,
arrow=arrow(length=unit(0.01, "npc")))
Error in ggtree(., layout = "rectangular", mrsd = "2021-02-17"): could not find function "ggtree"
# Generate our first heatmap layer
tree.plot <- gheatmap(tree.plot, curated_tree_info.df %>% select(parent, node, branch.length),
offset = 0.75, width = 0.1,
colnames_angle = 90, font.size = 8, hjust = 1) +
# Set the name of our legend
scale_fill_continuous(name = "Tree coding",
guide = guide_legend(order = 2))
Error in gheatmap(tree.plot, curated_tree_info.df %>% select(parent, node, : could not find function "gheatmap"
# use this code to create a new colour scale
tree.plot <- tree.plot + new_scale_fill()
Error in eval(expr, envir, enclos): object 'tree.plot' not found
# Generate a categorical heatmap layer for the Clade variable
tree.plot <- gheatmap(tree.plot, curated_tree_info.df %>% select(Clade),
offset = 0.85, width = 0.1,
colnames_angle = 0, font.size = 10) +
# Set the name and order of our Clade legend
scale_fill_discrete(name = "Nextstrain\nClade",
guide = guide_legend(order=3))
Error in gheatmap(tree.plot, curated_tree_info.df %>% select(Clade), offset = 0.85, : could not find function "gheatmap"
# use this code to create a new colour scale
tree.plot <- tree.plot + new_scale_fill()
Error in eval(expr, envir, enclos): object 'tree.plot' not found
# Generate a categorical heatmap layer for the PANGO variable
tree.plot <- gheatmap(tree.plot, curated_tree_info.df %>% select(PANGO),
offset = 1.0, width = 0.1,
colnames_angle = 0, font.size = 10) +
# Set the name and order of our PANGO legend
scale_fill_manual(values = combo.colours, name = "PANGO\nlineage",
guide=guide_legend(order=4))
Error in gheatmap(tree.plot, curated_tree_info.df %>% select(PANGO), offset = 1, : could not find function "gheatmap"
# View the tree
tree.plot
Error in eval(expr, envir, enclos): object 'tree.plot' not found
ggtree packageSo we’ve spent quite a bit of time looking at phylogenetic trees and
how to add external data to tips and nodes. We’ve barely scratched the
surface and there are a lot of additional functions within the
ggtree package that you can work with to build amazing
plots including adding fasta sequence data through
msaplot(). You can experiment with labeling internal nodes,
and specific clades as well. You can also facet_plot()
rectangular trees with other kinds of plots!
We didn’t even have time to combine all sorts of plot data with our
circular trees using the ggtreeExtra package. Definitely
check out Chapter 10
of the tree book to be inspired by the potential things you could do
with more complex datasets like this:
The geom_fruit() layer can be used to add extra visualization to your sunburst plot through the ggtreeExtra package
We’ve seen a lot of trees today but a closely related structure is the network diagram. They share similar concepts in that both have nodes and branches. However, nodes are called vertices and can represent almost anything, branches between nodes are edges and can span across multiple nodes, instead of in a bifurcating tree relationship. Relationships between vertices can be bidirectional, and depending on what you’d like to present, multiple parallel edges may exist. We saw a specific version of network diagrams in Lecture 04 with our Sankey plots!
Network graphs are great for showing the interconnections between entities within your data. This kind of visualization can also help us realize where subtle connections occur (or don’t occur!). Depending on how you’ve weighted or chosen edges, you can convey how strong relationships between entities are as well.
To work with graph data and plot it, we’ll be using a couple of
companion packages: tidygraph and ggraph. More
about that later!
Since we already have some metadata about COVID genomes, let’s see if we can’t convert some of it to a network diagram to better understand strain information? We’ll begin by selecting some information from our Nextstrain metadata set.
str(SC2_metadata.df, give.attr = FALSE)
spc_tbl_ [6,627 x 24] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Strain : chr [1:6627] "Wuhan/WH01/2019" "USA/MA1/2020" "Canada/ON_ON-VIDO-01-2/2020" "Canada/ON_VIDO-01/2020" ...
$ GISAID clade : chr [1:6627] "L" "L" "L" "L" ...
$ Nextstrain clade : chr [1:6627] "19A" "19A" "19A" "19A" ...
$ Age : num [1:6627] 44 21 NA NA 54 NA 47 47 23 55 ...
$ Clade : chr [1:6627] "19A" "19A" "19A" "19A" ...
$ Country : chr [1:6627] "Asia" "USA" "Canada" "Canada" ...
$ Admin Division : chr [1:6627] "Asia" "Massachusetts" "Ontario" "Ontario" ...
$ Admin Division of exposure: chr [1:6627] "Asia" "Asia" "Ontario" "Asia" ...
$ gisaid_epi_isl : chr [1:6627] "EPI_ISL_406798" "EPI_ISL_409067" "EPI_ISL_425177" "EPI_ISL_413015" ...
$ Host : chr [1:6627] "Human" "Human" "Human" "Human" ...
$ Originating Lab : chr [1:6627] "General Hospital of Central Theater Command of People's Liberation Army of China" "Massachusetts Department of Public Health" "Public Health Ontario" "Public Health Ontario Laboratory" ...
$ PANGO lineage : chr [1:6627] "B" "B" "B" "B" ...
$ Submission Date : chr [1:6627] "Older" "Older" "Older" "Older" ...
$ Region : chr [1:6627] "Asia" "North America" "North America" "North America" ...
$ Region of exposure : chr [1:6627] "Asia" "Asia" "North America" "Asia" ...
$ Sex : chr [1:6627] "Male" "Male" "Male" "Male" ...
$ strain : chr [1:6627] "Wuhan/WH01/2019" "USA/MA1/2020" "Canada/ON_ON-VIDO-01-2/2020" "Canada/ON_VIDO-01/2020" ...
$ Emerging Nextstrain clade : chr [1:6627] "19A" "19A" "19A" "19A" ...
$ Submitting Lab : chr [1:6627] "BGI & Institute of Microbiology, Chinese Academy of Sciences & Shandong First Medical University & Shandong Aca"| __truncated__ "Pathogen Discovery, Respiratory Viruses Branch, Division of Viral Diseases, Centers for Disease Control and Prevention" "Public Health Agency of Canada - National Microbiology Laboratory" "National Microbiology Laboratory" ...
$ url : logi [1:6627] NA NA NA NA NA NA ...
$ Collection Data : Date[1:6627], format: "2019-12-26" "2020-01-29" ...
$ Author : chr [1:6627] "Weijun Chen et al (https://dx.doi.org/10.1016/S0140-6736(20)30251-8)" "Clinton R. Paden et al (https://dx.doi.org/10.1101/2020.03.09.20032896)" "Amrit S. Boese et al" "Shari Tyson et al" ...
$ Country of exposure : chr [1:6627] NA "Asia" NA "Asia" ...
$ Location : chr [1:6627] NA "Boston" NA NA ...
Before we jump into making some network diagrams, we’ll want to trim
down the data from SC2_metadata.df. We don’t need all of
the data that is currently in this table. Instead we’ll trim it down to
10 variables:
Strain: the strain name of the genome. This matches the
tips of our original trees we made in section 1.0.0.Country: the country where the case was reported.Admin Division: the sub-region where the case was
identifiedAdmin Division of exposure: the sub-region believed to
be the source of the strain.Region: appears to be the continent where the case was
identified.Region of exposure: the continent was the case was
contracted.Nextstrain clade: the Nextstrain clade for this
strainGISAID clade: the clade classifiction defined by the
Global Initiative on Sharing Avian Influenza DataPANGO lineage: the Phylogenetic Assignment of Named
Global Outbreak lineageCollection Data: the collection date of the strain
(also misspelled)SC2_graph_info.df <-
# Pass along our metadata
SC2_metadata.df %>%
# We'll grab information about each case that relates to its geographical region
...(Strain, Country, 'Admin Division', 'Admin Division of exposure', 'Region of exposure', Region,
'Nextstrain clade', 'GISAID clade', 'PANGO lineage', 'Collection Data') %>%
# Do a bunch of renaming for our variables
...(source_location = 'Admin Division of exposure',
case_location = 'Admin Division',
source_region = 'Region of exposure',
case_region = 'Region',
collection_date = 'Collection Data',
Nextstrain = 'Nextstrain clade',
GISAID = 'GISAID clade',
PANGO = 'PANGO lineage'
)
Error in ...(., source_location = "Admin Division of exposure", case_location = "Admin Division", : could not find function "..."
head(SC2_graph_info.df)
Error in head(SC2_graph_info.df): object 'SC2_graph_info.df' not found
tidygraph package to help prepare
dataNow that we have some information that we want to work with, we want
to convert that kind of data to something that can be interpreted into a
graph. The tidygraph package provides a way to hook graph
data into the tidyverse so that we can use common verbs and
ideas to filter and work with it. Using this package we can convert our
dataframe information into a tbl_graph object which is
actually an igraph object.
The function we’ll use to convert our data is
as_tbl_graph() but it has some expectations about the data.
The parameters of this function are:
x: the data frame we’d like to convert to an
igraph.
nodes: a data frame with our node
information.
edges: a data frame of two columns containing
integers matching node information to describe the relationship between
nodes.
If you have both a nodes and edges data frame you can certainly
generate a table this way. However, we have a complex dataframe and we
want to track all of the information in it. To that end we will
add columns from and to based on
variables that already exist and the graph nodes will be generated from
this information. When we provide the data frame, the function will
recognize the columns present and produce node-based data and generate
edge characteristics from our other columns.
# Now we'll add some specific variables used to generate our graph table
SC2_graph_info.df %>%
# Create our "from" and "to" columns in our data frame
mutate(from = source_location,
to = Country
) %>%
# Filter for just strain data from Canada and Asia
filter(Country %in% c(...)) %>%
# Convert to an igraph
as_tbl_graph() %>%
# Look at the structure of the igraph
str()
Error in as_tbl_graph(.): could not find function "as_tbl_graph"
ggraph()The ggraph() function is one of many from a package of
the same name. This package adds geoms and architecture that is
compatible with ggplot2 (for the most part). Some of the
functions we are interested in are:
| Component | Function | Description |
|---|---|---|
| Graph | ggraph | The equivalent of the call to ggplot, it
sets up the basic information about the graph plot |
| Edge | geom_edge_link | Produces edges between points |
| Edge | geom_edge_fan | Produces edges between points but accounts for parallel edges, creating arcs that fan out |
| Edge | geom_edge_parallel | Produces multiple edges between points with parallel lines representing parallel edges |
| Edge | geom_edge_loop | Used to represent edges that start and end at the same node. Does not account for parallel edges |
| Node | geom_node_point | Add basic nodes to your graph |
| Node | geom_node_circle | Add nodes to your graph that can be scaled by the coordinate system |
The ggraph() function takes in 3 parameters:
graph: the igraph object that we’ll
base the graph on.
layout: the type of layout for the graph. There are
many options including: auto, dendrogram, linear, matrix, treemap (yep!)
circlepack, partition, and hive.
circular: a logical stating whether the layout
should be transformed into a radial representation. It can’t be applied
to all layouts (think polar_coord).
# Now we'll add some specific variables used to generate our graph table
SC2_graph_info.df %>%
# Generate our from and to edge information
mutate(from = source_region,
to = case_region
) %>%
# Convert to an igraph and pass it on to be plotted
as_tbl_graph() %>%
# 1. Data
ggraph(.) +
# 2. Aesthetics
theme(text = element_text(size = 10)) +
### 2.2.0 Add our edges
...(aes(colour = Nextstrain), arrow=arrow(), width = 1) +
### 2.2.0 Add our nodes and label them
...(size = 7) +
...(aes(label = name), size = 5, repel = TRUE)
Error in ggraph(.): could not find function "ggraph"
geom_edge_fan() to visualize parallel
edgesSo from our graph we can see some characteristics about how our case data is connected. Although most of the data is centred around genomes sequenced in North America, we can see that “North America” is also a source region.
The way we see the edges, however also has all of them overlaying on
top of each other. We know that this isn’t going to be a 1:1
relationship and although we have coloured the edges, we only see the
topmost edge in a stack of many. For a graph like this, if we
want to see all of the edges, we can use the
geom_edge_fan() layer which will help us see all of our
parallel edges in all of their splendour. Of note, we can also use
geom_edge_parallel() as a layer. While this layer may make
the number of connections clearer, it can get quite crowded.
Let’s try it out.
# Now we'll add some specific variables used to generate our graph table
SC2_graph_info.df %>%
# Generate our from and to edge information
mutate(from = source_region,
to = case_region
) %>%
# Convert to an igraph
as_tbl_graph() %>%
# 1. Data
ggraph(.) +
# 2. Aesthetics
theme(text = element_text(size = 10)) +
### 2.2.1 Add our edges
...(aes(colour = ...), arrow=arrow()) +
# Add our nodes and label them
geom_node_point(size = 7) +
geom_node_text(aes(label = name), size = 5, repel = TRUE)
Error in ggraph(.): could not find function "ggraph"
The metadata we’ve chosen isn’t very detailed and that can be the
case for many datasets so you’ll want to be sure of how you pick your
nodes. Right now we are picking our from and
to values based on the supposed source and identifying
continents of the strain.
While other variables could offer more information, they can vary in
consistency from a region (Asia) to a province (Ontario) as seen in
variable case_location. Definitely aim to be consistent
when you’re working with your data otherwise the nodes may have less
meaning.
Let’s try looking at source_region (Continent) to
case_location (Divisions) as it relates to data from Canada
and Asia
# Now we'll add some specific variables used to generate our graph table
SC2_graph_info.df %>%
# Generate our from and to edge information
mutate(from = ...,
to = ...
) %>%
# Filter for Canada data
filter(Country %in% c("Canada", "Asia")) %>%
# Convert to an igraph
as_tbl_graph() %>%
# 1. Data
ggraph(.) +
# 2. Aesthetics
theme(text = element_text(size = 20)) +
# Add our edges
geom_edge_fan(aes(colour = Nextstrain), arrow=arrow()) +
### 2.2.2 If we suspect looping edges, we need to define that layer specifically
... +
# Add our nodes and label them
geom_node_point(size = 7) +
geom_node_text(aes(label = name), size = 7, repel = TRUE)
Error in ggraph(.): could not find function "ggraph"
ggraphInterestingly in the first year of the pandemic, from the sequenced genome data, infections within Canada appear to originate mostly from within North America but there are some infections that originated from Asia, and entered through Ontario and British Columbia - our two main global ports of entry!
We’ve really only covered linear graph layouts in our data but there
are actually many kinds of graphs that this package can
produce. The layout parameter is the key to exploring all
of the graph types available. Of course your data layout all has to make
sense! You’ll find great examples from data-imaginist.com
like this circlepack graph:
There are all kinds of network graphs that can use additional layers of metadata to shape and present your data
When working with data sometimes you may be working with sequence-specific data for analysis of motifs or you may have a large set of data describing genomic markers like single nucleotide variants. Two popular visualizations in this domain are the sequence logo and Manhattan plot.
ggseqlogoThis one is short and simple. If you have a series of sequences you’d
like to plot based on frequency of each bases you see at each position,
the sequence logo is for you. The ggseqlogo package offers
a simple ggplot2-compatible set of geoms you can add to
your graph in order to represent this data.
Your data can come in two flavours:
A named list where each index position is a vector of sequences, and the names of each index can have meaning (like a transcription factor).
A frequency matrix where columns correspond to positions, and rownames correspond to bases (or amino acids). Each entry is the number of times that a base is seen at a specific position. These matrices could be daisy-chained in a list.
Let’s work with the former since it does have some flexibility and looks like something we recognize.
# Build an example sequence set (it's actually from the SARS-CoV-2 spike sequence )
example_seq1 <- c("aacccactaatggtgttggtt","aatccactaatggtgttgctt","aacccaataatagtgttggtt",
"aacccactaatggtgttggtt","aacccactaatggtattgatt","accccactaatgttgttggtt",
"aacccactaatggtgttggtt","aacccactaatggtattgatt","accccactaatgttgttggtt") %>%
# Convert it all to upper case because ggseqlogo appears to hate when we don't
str_to_upper()
example_seq2 <- c("cacccactaatggtgttggtg","aatccactaatggtgttgctt","aacccaataatagtgttggtt",
"cacccactaatggtgttggtg","aacccactaatggtattgatt","accccactaatgttgttggtt",
"cacccactaatggtgttggtg","aacccactaatggtattgatt","accccactaatgttgttggtt") %>%
# Convert it all to upper case because ggseqlogo appears to hate when we don't
str_to_upper()
# Build our list
example_seq.list = list(spike_set1 = ..., spike_set2 = ...)
Error in eval(expr, envir, enclos): '...' used in an incorrect context
# Take a look at our list
str(example_seq.list)
Error in str(example_seq.list): object 'example_seq.list' not found
ggseqlogo()
wrapper functionWe have two options to look at our sequence data. The
ggseqlogo() is a wrapper that sets up the entire plot for
us including if you’d like to facet the data. If you just want a single
logo then you pass parts of the list individually.
If you’d like multiple logos, you can decided their layout with the
ncol or nrow parameters. The other parameters
are:
facet which will accept either “wrap” or
“grid”.
scales which will accept the scale types for the
facet geom.
# Make a 2-row sequence logo
ggseqlogo(..., ...) +
theme(text = element_text(size = 20)) +
labs(title = "Facet in rows")
Error in eval(expr, envir, enclos): '...' used in an incorrect context
# Make a 2-column sequence logo
ggseqlogo(..., ...) +
theme(text = element_text(size = 20)) +
labs(title = "Facet in columns")
Error in eval(expr, envir, enclos): '...' used in an incorrect context
geom_logo()To plot our data we can use ggplot language that we are
familiar with and add a layer to our plot with the
geom_logo() function. This layer is also supplied by the
ggseqlogo package. In order for this to function properly,
however, we need to define our parameters as:
data: the vector with of our sequence data
seq_type: sets the type of sequence like “DNA”,
“RNA” or “AA”
method: the y-scale of our logo either in “bits”
(information value) or by “probability”
These and other parameters can also be set in the
ggseqlogo() function too.
# 1. Data
ggplot() +
# 2. Aesthetics
theme_logo() +
theme(text = element_text(size = 20)) +
labs(title = "Facet in columns with geom_logo()") +
# 4. Geoms
geom_logo(data = example_seq.list, seq_type="DNA", method = "bits") +
# 6. Facet
facet_wrap(...)
Error in geom_logo(data = example_seq.list, seq_type = "DNA", method = "bits"): object 'example_seq.list' not found
The Manhattan plot is usually used in visualizing data across the genome from LOD scores to marker proportions.
Named for it’s visual similarity to the Manhattan skyline of singular towering skyscrapers scattered above low-level buildings, this plot is commonly used for visualizing genomic markers across an ordered linear axis like a series of chromosomes. The y-axis of the values can represent LOD scores or proportions of marker representation. This is frequently used to visualize GWAS or mapping data.
This is another scatterplot that we can generate directly in
ggplot with some effort/setup or we can use a package like
qqman.
First let’s look at the kind of data. Fun fact! You can read in archived/zipped data as well, although be careful, GWAS files can be quite large!
load(...)
Error in eval(expr, envir, enclos): '...' used in an incorrect context
str(..., give.attr = FALSE)
Error in eval(expr, envir, enclos): '...' used in an incorrect context
manhattan() function to generate a
plotTo generate our Manhattan plot we can use the
manhattan() wrapper function which will require four sets
of parameter information:
chr: the column name with the chromosome information
for a given SNP.
bp: the basepair position of the SNP.
snp: the ID of the SNP usually a RefSNP ID (RSID) of
some kind.
p: the p-value for that particular SNP generated
from the case vs. control analysis. This will be converted to a \(-log10(p)\) value for the y-axis.
# manhattan(gwasResults, chr="CHR", bp="BP", snp="SNP", p="P")
GWAS_data.df %>%
# You can filter your data if you want this step to run faster
# filter(CHR %in% c(1:5)) %>%
...(., chr="CHR", bp="POS", snp="rsid", p="p.value")
Error in ...(., chr = "CHR", bp = "POS", snp = "rsid", p = "p.value"): could not find function "..."
highlight parameter to identify markers
of interestNow that we’ve plotted our Manhattan plot, we can see some horizontal lines corresponding to minimum p-values of 1.0 x 10-5 and 5.0 x 10-8 although the significance of these can depend on sample size and allele frequencies of your SNPs.
Either way, you can highlight SNP results based on a provided list.
In this case, we can generate a list ourselves by filtering on the
p.value variable.
snp_candidates <-
# Pass along our GWAS data
GWAS_data.df %>%
# Filter it for low p-values
filter(-log10(p.value) >= 5) %>%
# Select SNP candidates
pull(...) %>%
unique()
Error in unique(.): '...' used in an incorrect context
# snp_candidates
# Plot our Manhattan plot
manhattan(GWAS_data.df,
chr="CHR",
bp="POS",
snp="rsid",
p="p.value",
highlight = ...)
Error in eval(expr, envir, enclos): '...' used in an incorrect context
When preparing figures for a presentation or manuscript consider some of the following tips/tricks that I’ve accumulated over the years. Not all may apply to your work but these can certainly be helpful.
Regardless of whether or not you’ll be giving a talk or submitting a manuscript you’ll want to consider how many figures you need to tell your story.
Most major publications will have 4-6 main figures with usually an (unlimited nowadays) number of supplemental figures.
Presentations generally have 1 slide per minute of talk so plan accordingly.
While publication requirements vary from journal to journal, most have similar rules
Minimum dpi (resolution) of your images: 300dpi
Minimum font size (ie the very smallest any font should be on a sub panel): Usually 6-7pts
Maximum image size. There is usually a maximum figure size in terms of dimensions but this really can vary widely. Check with a few potential publications first to get an idea of the main size/layout (portrait vs landscape) for figures.
Keep ALL figure data organized for submission! This is the standard now and can include sequences, or any data that was used to create your figures. You can organize this in a single CSV, or worksheet-based file.
Figures should be decipherable at a glance. A successful figure’s general message can be understood without having to read through the figure legends. This is not always possible but is certainly something to strive for.
Make your font big enough to read. Unlike a manuscript, people cannot zoom in on your presentation figures
Limit panels to 2-4 at most per figure.
Forget about the title on your figure. Title the slide instead with your take-home message about the figure. Having an effective slide title tells your audience where to focus on your figure. This is especially key when you are presenting a complicated or “busy” figure.
When presenting a figure with multiple categories or ideas, reveal it piecemeal. Present the base idea (ie a control) and then reveal the next part of the panel showing your first experimental condition, then your second etc. This allows the audience to acclimate to the idea you wish to convey.
If you can’t read the text of a figure because there are just too many things, consider simplifying or removing the text. This will reduce on distracting imagery and focus your message to your audience.
Keep a script (or notebook!) for remaking all your figures!
Simplify with a few variables things like the location of your datasets!
Build functions to make figures of the same type with different datasets.
Create a master file/table to import and use these datasets and functions.
Create a specific colour-scheme for your figures
Try to use the same colour for controls versus specific mutant genotypes you may be examining or re-using frequently
Use a colourblind-friendly palette
Save your plots as SVG, PNG or TIFFs
SVGs are vectorized meaning they are files representing lines, curves, and shapes based on a coordinate system. This means they are generally resolutionless and can be blown up to any size as needed. Great for working with on presentations and simple to alter in most image editing programs. Ideal for using on graphs and ggplot figures but not microscopy or other digital images.
Avoid JPEGs as much as possible. The tend to have poor (lossy) encoding that can create resolution loss unless you save them as big high-quality JPEG versions (lossless).
With this final lecture we’ve covered the spectrum of visualizations covering the most basic of scatter- and barplots, graduating to boxplots, violin plots, and their variants. We’ve covered high-throughput data visualizations including volcano plots, heatmaps, and principal component analysis. Furthermore we looked at how simple the projection of high-dimension data can be with t-SNE and UMAP.
We finished our course today covering phylogenetic trees, network graphs, and some sequence analysis visualizations. Overall you now have the tools to wrangle data that may appear in all sorts of formats along with a better understanding of when and how to prepare some of the most common data visualizations in your burgeoning science careers.
Congratulations and good luck on your data science journey!
We have created a post-course survey you can fill out anonymously. You can use this survey as an opportunity to tell us about your experience and help shape the future offerings of this series. Please take 5-10 minutes to fill out the survey. We really appreciate your feedback!
Anonymous Google Survey found here
This week’s assignment will be found under the current lecture folder under the “assignment” subfolder. It will include an R markdown notebook that you will use to produce the code and answers for this week’s assignment. Please provide answers in markdown or code cells that immediately follow each question section.
| Assignment breakdown | ||
|---|---|---|
| Code | 50% | - Does it follow best practices? |
| - Does it make good use of available packages? | ||
| - Was data prepared properly | ||
| Answers and Output | 50% | - Is output based on the correct dataset? |
| - Are groupings appropriate | ||
| - Are correct titles/axes/legends correct? | ||
| - Is interpretation of the graphs correct? |
Since coding styles and solutions can differ, students are encouraged to use best practices. Assignments may be rewarded for well-coded or elegant solutions.
You can save and download the markdown notebook in its native format. Submit this file to the the appropriate assignment section by 12:59 pm on the date of our next class: April 25th, 2024.
Revision 1.0.0: created and prepared for CSB1021H S LEC0141, 03-2021 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Revision 1.0.1: edited and prepared for CSB1020H S LEC0141, 03-2022 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Revision 1.0.2: edited and prepared for CSB1020H S LEC0141, 03-2023 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Revision 2.0.0: Revised and prepared for CSB1020H S LEC0141, 03-2024 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
The book of ggtree from the Yu Lab: https://yulab-smu.top/treedata-book/index.html
Chapter 10 of ggtree with amazing and complicated plots:
https://yulab-smu.top/treedata-book/chapter10.html
More information on the ggraph package: https://cran.r-project.org/web/packages/ggraph/ggraph.pdf
ggseqlogo package information: https://cran.r-project.org/web/packages/ggseqlogo/ggseqlogo.pdf
Manhattan plot tutorial: https://www.r-graph-gallery.com/101_Manhattan_plot.html
GWAS p-value threshold: https://www.nature.com/articles/s41380-020-0670-3?proof=t
More on GWAS thresholds for significance: https://www.nature.com/articles/ejhg2015269.pdf
The Centre for the Analysis of Genome Evolution and Function (CAGEF) at the University of Toronto offers comprehensive experimental design, research, and analysis services in microbiome and metagenomic studies, genomics, proteomics, and bioinformatics.
From targeted DNA amplicon sequencing to transcriptomes, whole genomes, and metagenomes, from protein identification to post-translational modification, CAGEF has the tools and knowledge to support your research. Our state-of-the-art facility and experienced research staff provide a broad range of services, including both standard analyses and techniques developed by our team. In particular, we have special expertise in microbial, plant, and environmental systems.
For more information about us and the services we offer, please visit https://www.cagef.utoronto.ca/.